Impact of data fragmentation across healthcare centers on the accuracy of a high-throughput clinical phenotyping algorithm for specifying subjects with type 2 diabetes mellitus
- Wei-Qi Wei1,2,
- Cynthia L Leibson3,
- Jeanine E Ransom2,
- Abel N Kho4,
- Pedro J Caraballo5,
- High Seng Chai2,
- Barbara P Yawn6,
- Jennifer A Pacheco7,
- Christopher G Chute2
- 1Institute for Health Informatics, University of Minnesota, Twin Cities, Minnesota, USA
- 2Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, Minnesota, USA
- 3Division of Epidemiology, Mayo Clinic, Rochester, Minnesota, USA
- 4Divisions of General Internal Medicine and Biomedical Informatics, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
- 5Division of General Internal Medicine, Mayo Clinic, Rochester, Minnesota, USA
- 6Department of Research, Olmsted Medical Center, Rochester, Minnesota, USA
- 7Center for Genetic Medicine, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
- Correspondence to Dr Christopher G Chute, Division of Biomedical Statistics and Informatics, Mayo Clinic, 200 First St SW, Rochester, MN 55905, USA;
- Received 13 September 2011
- Accepted 21 December 2011
- Published Online First 16 January 2012
Objective To evaluate data fragmentation across healthcare centers with regard to the accuracy of a high-throughput clinical phenotyping (HTCP) algorithm developed to differentiate (1) patients with type 2 diabetes mellitus (T2DM) and (2) patients with no diabetes.
Materials and methods This population-based study identified all Olmsted County, Minnesota residents in 2007. We used provider-linked electronic medical record data from the two healthcare centers that provide >95% of all care to County residents (ie, Olmsted Medical Center and Mayo Clinic in Rochester, Minnesota, USA). Subjects were limited to residents with one or more encounter January 1, 2006 through December 31, 2007 at both healthcare centers. DM-relevant data on diagnoses, laboratory results, and medication from both centers were obtained during this period. The algorithm was first executed using data from both centers (ie, the gold standard) and then from Mayo Clinic alone. Positive predictive values and false-negative rates were calculated, and the McNemar test was used to compare categorization when data from the Mayo Clinic alone were used with the gold standard. Age and sex were compared between true-positive and false-negative subjects with T2DM. Statistical significance was accepted as p<0.05.
Results With data from both medical centers, 765 subjects with T2DM (4256 non-DM subjects) were identified. When single-center data were used, 252 T2DM subjects (1573 non-DM subjects) were missed; an additional false-positive 27 T2DM subjects (215 non-DM subjects) were identified. The positive predictive values and false-negative rates were 95.0% (513/540) and 32.9% (252/765), respectively, for T2DM subjects and 92.6% (2683/2898) and 37.0% (1573/4256), respectively, for non-DM subjects. Age and sex distribution differed between true-positive (mean age 62.1; 45% female) and false-negative (mean age 65.0; 56.0% female) T2DM subjects.
Conclusion The findings show that application of an HTCP algorithm using data from a single medical center contributes to misclassification. These findings should be considered carefully by researchers when developing and executing HTCP algorithms.
- electronic medical record
- research techniques
- type 2 diabetes mellitus
- EMR secondary and meaningful use
- information retrieval
- data mining
- medical informatics
- infection control
- biomedical informatics
- knowledge representations
- controlled terminologies and vocabularies
- information retrieval
- HIT data standards
Background and significance
Subject selection—the process of identifying patients with specific clinical characteristics—is an essential component of clinical studies. Accurate selection consumes considerable time and effort to gather, abstract, and review medical charts, and it is often the rate-limiting step in clinical research.1 Recently, the increased adoption of electronic medical record (EMR) systems has provided researchers with an advanced tool to improve this inefficient process.2 By leveraging the machine-processable content through an EMR system, clinical researchers can develop a high-throughput clinical phenotyping (HTCP) algorithm (a set of inclusion and exclusion criteria for identifying patients with specified characteristics), execute the algorithm against already existing data within an EMR system, and rapidly obtain a large pool of eligible study subjects.3–6
The Electronic Medical Records and Genomics (eMERGE) Network,5 a national consortium funded by the National Human Genome Research Institute, has devoted substantial efforts to exploring the possibility of leveraging EMRs as resources for subject selection. The eMERGE I Network consisted of five national leading academic medical centers: Mayo Clinic, Rochester, Minnesota; Northwestern University Medical Center, Chicago, Illinois; Vanderbilt University Medical Center, Nashville, Tennessee; Marshfield Clinic in Wisconsin, Marshfield, Wisconsin; and the Group Health Cooperative with the University of Washington, Seattle, Washington. One of its primary goals was to develop HTCP algorithms for identifying subjects suitable for genotype- and phenotype-associated studies. In order to ensure that an algorithm is transportable and that various institutions can execute it to obtain reliable outputs, each algorithm developed in the eMERGE Network was proposed, reviewed, and validated by domain experts across participating medical centers.
The HTCP approach of leveraging EMR data for subject selection is appealing because it offers increased efficiency while reducing the large amount of manual detail work that is required. We hypothesize that results of HTCP are more accurate if all medical data for every patient are available for review. However, the ability to capture all of a patient's medical data is limited when patients are seen by multiple healthcare centers. A recent study indicates that, of the nearly 3.7 million patients who sought treatment in acute care settings in Massachusetts during a 5-year period, over 30% visited more than one hospital and 1%—or 43 794 patients—visited five or more hospitals during the study period.7 Similar findings on multiple healthcare centers for primary care visits were reported by Smith et al.8 The resultant data fragmentation across healthcare centers leads to incomplete data from any one EMR when researchers execute an HTCP algorithm at a single medical center. The absent data could be crucial in qualifying or disqualifying a study subject and could cause subject selection errors.
Previous studies of the effect of data fragmentation on clinical outcomes suggested that data fragmentation wasted valuable medical resources and could adversely affect treatment outcomes.8–13 Cox and his colleagues14 investigated the influence of missing data and demonstrated that subjects with missing data differed significantly in terms of variables crucial to the study outcome and that distortion led to biased results.
To our knowledge, the impact of data fragmentation across healthcare centers on an HTCP algorithm has not been explicitly investigated. The present study evaluated the effect of data fragmentation on an HTCP algorithm developed within the eMERGE Network for specifying patients with type 2 diabetes mellitus (T2DM).
The eMERGE T2DM algorithm
T2DM is a multiple gene-related chronic disease that poses an enormous public health burden.15 As provided in detail elsewhere (unpublished material, Wei W, 2011),16 the eMERGE T2DM algorithm is EMR based and was developed by researchers from Northwestern University and enhanced by other participating institutes within the eMERGE Network. The primary goal of this algorithm is to maximize the positive predictive value (PPV) or the precision of identifying ‘T2DM subjects’, a term used herein to mean patients with T2DM, and to avoid confounding by inclusion as subjects individuals without any type of diabetes mellitus (DM) or individuals with type 1 DM (T1DM). With respect to unaffected subjects (herein termed ‘non-DM subjects’), the goal of the algorithm is to maximize the PPV of identifying individuals with no DM, excluding even those at risk of DM which has not yet manifested itself (ie, pre-DM).
Previous evidence has suggested that ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) codes alone would not provide enough accuracy to identify patients with DM.17 18 More importantly, T2DM subjects identified using only diagnosis codes could be contaminated with T1DM subjects because many patients are assigned the code for ‘diabetes mellitus, unspecified type’ and some patients with T2DM diagnosis codes are actually T1DM subjects who have been wrongly assigned a code for T2DM. To avoid such potential misclassification, the algorithm developers supplemented the use of diagnosis codes with relevant laboratory results and medication prescriptions (figures 1 and 2).
Previous evaluation studies indicated that the algorithm achieved 98% and 100% PPVs for identification of T2DM subjects and non-DM subjects, respectively, compared with clinician review.16 However, both the EMR data and the records that were reviewed came from the same medical center. Thus the effect of data fragmentation across healthcare centers on its performance is still unknown. We chose to evaluate the effect of data fragmentation across healthcare centers on the basis of this algorithm because it involves virtually all structured EMR data (ie, diagnosis, laboratory values, and medication) and has demonstrated high accuracy within a single medical center.
Materials and methods
This is a population-based medical records study. It was conducted in Olmsted County, Minnesota (2010 census =144 248).
The study took advantage of Rochester Epidemiology Project (REP) resources.19 REP is a medical records-linkage system for all residents of Olmsted County, which has been continuously funded by National Institutes of Health since 1966. Population-based studies using REP resources are afforded because Rochester, the county seat, is geographically isolated (approximately 136 km from the nearest urban center) and home to Mayo Clinic, one of the world's largest medical centers. Thus >95% of medical care received by County residents is provided by either Mayo Clinic, with its two affiliated hospitals, or Olmsted Medical Center (OMC), a second group practice, with its affiliated hospital.20 The REP maintains a unique identifier for each Olmsted County resident over time and across healthcare centers, and each resident's clinical data from virtually all sources of medical care (hospital inpatient, hospital outpatient, emergency department, office, and nursing home visits) can be combined for approved clinical research.19 20
The study was approved by the Mayo Clinic and the OMC Institutional Review Boards. We first used the REP census19 to identify all unique individuals residing in Olmsted County in 2007. Persons who refused authorization for use of their medical record in research at either OMC or Mayo Clinic (typically, <5%21) were excluded. To be eligible for this study, subjects had to be Olmsted County residents and to have had at least one encounter at Mayo Clinic from January 1, 2006 through December 31, 2007 and also at least one encounter at OMC from January 1, 2006 through December 31, 2007.
We obtained 2 years of EMR data for eligible patients (2006 and 2007) from OMC and Mayo Clinic separately. We searched administrative claims data to determine the presence (or absence) of DM-relevant ICD-9-CM codes (see online appendix (ICD_codes)). Outpatient laboratory data for DM-relevant tests were reviewed to determine whether a subject had an abnormal value. The sources of medication data were electronic outpatient prescription databases. One author (PJC), a licensed internal medicine physician with a focus on diabetes, manually reviewed the databases and provided a list of generic drug names, brand names, synonyms, and abbreviations for DM-relevant medications (see online appendix (drug list)). We searched the medication data for the terms on the list to determine whether or not a patient had been prescribed any such medications.
We first executed the eMERGE T2DM algorithm (figures 1 and 2) on EMR data combined from both OMC and Mayo Clinic EMR systems. The categorization of eligible patients as ‘T2DM subjects’ and ‘non-DM subjects’ using data from both EMR systems was considered the gold standard in this study. We then executed the algorithm using Mayo Clinic EMR data alone; for T2DM subjects and non-DM subjects separately, we calculated the number of true positives (TPs), false positives (FPs), true negatives (TNs), and false negatives (FNs) against the gold standard. We also estimated PPVs and false-negative rates (FNRs) to evaluate misclassification errors caused by data fragmentation across healthcare centers.
The numerator for PPV is the number of TP subjects—that is, those identified using EMR data from both centers who were also identified as such using EMR data from Mayo Clinic alone. The denominator for PPV is the sum of TP subjects plus the number of FP subjects, with FP subjects defined as subjects categorized as not subjects by the gold standard but as subjects by EMR data from Mayo Clinic alone.
The numerator for FNR is the number of subjects categorized as subjects using the gold standard, but who were categorized as not subjects using EMR data from Mayo Clinic alone. The denominator for FNR is the number of FN subjects plus the number of TP subjects.
The McNemar test22 was used to analyze whether the categorization that resulted from the use of EMR data from two centers differed from the categorization when Mayo Clinic EMR data alone were used. The distributions of two commonly used epidemiological characteristics (age and sex) were compared between TP T2DM subjects and FN T2DM subjects to estimate whether T2DM subjects falsely excluded because of data fragmentation were statistically different from identified T2DM subjects. Comparison of the mean age between two groups was performed with the t test. Comparison of sex proportions was performed with the χ2 test. Statistical significance was accepted when p was <0.05. All data are presented as mean and SD. Statistical analysis was performed with R for Windows software V.22.214.171.124
Of 139 654 Olmsted County residents in 2007, 12 740 (9.1%) had at least one encounter at the Mayo Clinic from January 1, 2006 through December 31, 2007 and at least one encounter at OMC within the same time frame (table 1). These 12 740 residents were eligible for the present study.
T2DM subject identification
Of the 12 740 eligible subjects, 6.0% (765) met eMERGE T2DM algorithm inclusion criteria for T2DM subjects when their combined Mayo Clinic and OMC EMR data were used (table 2). These 765 subjects were considered true T2DM subjects for this study.
By comparison, 540 patients were identified as T2DM subjects when their Mayo Clinic EMR data alone were used; 513 were TP and 27 were FP (table 2). The PPV was 95% (513/540). The other 252 true T2DM subjects were FN (ie, incorrectly excluded when Mayo Clinic EMR data alone were used). The FNR was 32.9% (252/765). We found differences in the mean age (p=0.012) and sex proportion (p=0.004) between the group of 513 correctly identified T2DM subjects (62.1 (15.2) years; female to male ratio, 230:283) and the group of 252 missed T2DM subjects (65.0 (14.7) years; female to male ratio, 141:111). The McNemar test also indicated a difference between the categorization with EMR data from both centers and the categorization with data from Mayo Clinic alone (p<0.001).
With respect to which eMERGE inclusion/exclusion criteria (see figure 1) accounted for the misclassification of T2DM subjects with Mayo Clinic EMR data alone, all 27 FP T2DM subjects and 111 of the 252 (44%) FN T2DM subjects resulted from incomplete diagnosis codes at Mayo Clinic (table 3). Incomplete medication data at Mayo Clinic led to 75 (30%) FN T2DM subjects; an additional 53 (21%) FN T2DM subjects resulted from having only one encounter at Mayo Clinic 2006–2007 (the algorithm required at least two encounters). The remaining 13 (5%) FN T2DM subjects had abnormal laboratory results missing at Mayo Clinic.
Non-DM subject identification
With EMR data from both OMC and Mayo Clinic, 4256 subjects were identified by the algorithm as non-DM subjects (table 2). These were considered gold-standard non-DM subjects for this study.
With EMR data from Mayo Clinic alone, 2898 eligible subjects were categorized as non-DM subjects. However, only 2683 were TP, 215 were FP, and 1573 were FN (ie, incorrectly excluded as non-DM subjects when EMR data from a single medical center were used). The PPV and FNR were 92.6% (2683/2898) and 37.0% (1573/4256), respectively (table 2). Statistical analysis indicated a difference between the categorization with data from the two healthcare centers and that with data from Mayo Clinic alone (p <0.001).
With respect to which eMERGE inclusion/exclusion criteria (see figure 2) accounted for the misclassification of non-DM subjects with Mayo Clinic EMR data alone, incomplete laboratory data contributed to 135 (63%) FP non-DM subjects and 1074 (68%) FN non-DM subjects (table 3). Incomplete diagnosis codes contributed to another 73 (34%) FP non-DM subjects, and 499 (32%) of FN non-DM subjects resulted from having fewer than two encounters at Mayo Clinic 2006–2007.
Current clinical research is limited by a labor-intensive subject selection process, which has become a formidable obstacle to conducting broad and deep studies and drawing powerful conclusions. An HTCP algorithm leverages machine-processable EMR data, improving such inefficiency. Oftentimes, a patient is seen by multiple medical centers, and thus a single medical center does not have the patient's complete medical data when executing an algorithm. To our knowledge, how this data fragmentation across healthcare providers affects the accuracy of an HTCP algorithm has not been previously investigated. Such an investigation is difficult to conduct because it requires accessing multiple EMRs from heterogeneous sources at multiple medical centers. By taking advantage of the REP, we accomplished such a novel demonstration.
When using the combined Mayo Clinic and OMC EMR data for the 12 740 eligible subjects, 6.0% (765) met eMERGE T2DM algorithm inclusion criteria for T2DM subjects (table 2). This percentage is slightly lower than the prevalence of DM for all age groups in the USA (8.3%)24 because not all Olmsted County residents were tested for DM in the 2 years of the study.
Our results, combined with findings from other studies,8 14 show the advantage of access to more complete data for clinical research. In the present study, data fragmentation across healthcare centers resulted in incomplete data for any one EMR when the eMERGE T2DM algorithm was executed in Olmsted County, and that incompleteness substantially decreased the algorithm's accuracy.
For T2DM subject identification, we found categorization differences with data from both centers relative to the use of data from any one alone. The differences were mainly the result of a large proportion of FN T2DM subjects (n=252; FNR, 32.9%). The 252 FN T2DM subjects differed with respect to age and sex distribution from the 513 TP T2DM subjects. This difference suggests that, for age/sex-matched designs, matching could be skewed when HTCP algorithms are applied to EMR data from a single medical center. Even though the eMERGE T2DM algorithm is reported to achieve 98% for identification of T2DM subjects compared with clinician review,16 we still identified 27 (5.0%) FP T2DM subjects because of data fragmentation across healthcare centers.
For non-DM subject identification, we also found categorization differences using data from both centers relative to using data from any one alone. The differences were mainly the result of a large proportion of FN non-DM subjects (n=1573; FNR, 37.0%). Even though the eMERGE T2DM algorithm is reported to achieve 100% PPVs for identification of non-DM subjects compared with clinician review,16 we still identified 215 (7.4%) FP non-DM subjects because of data fragmentation across healthcare centers.
An incomplete diagnosis is the main reason for FP errors and accounted for all FP T2DM subjects. Absent laboratory results and incomplete diagnosis led to the majority of FP non-DM subjects. FNs were caused by the incompleteness of diagnosis, laboratory values, or prior medications. We also found that 53 subjects (21%) and 499 subjects (32%) were missed because they had made fewer than two clinical visits during the study period. As the time frame we used was 2 years, which is broader than the recommended frequency of T2DM visits (3–6 months),24 25 these insufficient clinical visits must have resulted from data fragmentation across centers as well.
The misclassification errors caused by data fragmentation could lead to sampling bias and risk serious distortions in the findings of resulting studies.26 These outcomes should be carefully considered by clinical researchers when developing or executing an algorithm. The ultimate solution for the data fragmentation problem is integrating EMR systems across various healthcare centers. However, to achieve such an ambitious goal, not only do serious technological challenges exist, but also complex ethical issues need to be addressed. Some ONC (the Office of the National Coordinator) funded Beacon projects prototype this issue.27
Clinical narratives (unstructured clinical data) document a patient's detailed description about diseases that may contain data from other healthcare centers. This additional information can be extracted by using natural language processing techniques and turned into normalized data for further analysis using other advanced techniques—for example, data mining.28 Then, discovered patterns could be reviewed and adopted in subject selection criteria. This approach may work with the caveat that additional data must be relevant for the condition under study. Our previous work, along with other studies, has shown its potential for subject selection tasks.6 29–32
Several issues about this study design should be considered when interpreting the findings. Because of unavoidable random or systematic errors (eg, physician experience, communication quality between the patient and the clinician, and coding quality), it is extremely difficult to obtain a patient's actual condition or the true gold standard.33 The manual effort required to validate the distinction between T1DM and T2DM obtained using the algorithm against medical review requires information at the time of DM onset34 and was beyond the scope of the present study. In this study, our gold standard was based on classifications using 2 years of EMR data from two major healthcare centers in Olmsted County. Because most Olmsted County residents receive their healthcare at these two healthcare centers and the observation window we chose is much broader than the recommended frequency of T2DM visits, this is a pragmatic gold standard for this study.
Our results may not generalize to large metropolitan areas. Our study setting is a sparsely populated, relatively isolated county in southeastern Minnesota. The residents of Olmsted County have fewer options for healthcare centers than people living in a large metropolitan area. Thus the misclassification errors that we found by comparing the selected categorizations are most likely smaller than in a usual situation. Also, this study focuses on how HCTP is affected by incomplete data due to data fragmentation across healthcare centers alone. It does not investigate the impact of incomplete data due to other factors, for example, insufficient longitudinal data, which is a topic for another study (unpublished material, Wei W, 2011). In addition, the algorithm scope of our study is limited to the eMERGE T2DM algorithm alone. For a more complete evaluation of the impact of data fragmentation on an HTCP algorithm, this study needs to be repeated at different geographic locations under various periods of observation on a wide spectrum of HTCP algorithms.
This study, to our knowledge, is the first attempt to assess the impact of data fragmentation on an HTCP algorithm across multi-institution EMRs. Our results show that data fragmentation across healthcare centers causes misclassification errors of an HTCP algorithm. This risk should be carefully considered by clinical researchers when developing or executing an HTCP algorithm.
Funding This study was supported by the Biomedical Informatics and Computational Biology Traineeship Program, the University of Minnesota, and the eMERGE project, NIH U01 HG04599.
Competing interests None.
Patient consent Obtained.
Ethics approval Mayo Clinic and OMC IRBs.
Provenance and peer review Not commissioned; externally peer reviewed.