Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study
- Abel N Kho1,
- M Geoffrey Hayes1,
- Laura Rasmussen-Torvik1,
- Jennifer A Pacheco1,
- William K Thompson1,
- Loren L Armstrong1,
- Joshua C Denny2,
- Peggy L Peissig3,
- Aaron W Miller3,
- Wei-Qi Wei4,
- Suzette J Bielinski4,
- Christopher G Chute4,
- Cynthia L Leibson4,
- Gail P Jarvik5,
- David R Crosslin5,
- Christopher S Carlson6,
- Katherine M Newton7,
- Wendy A Wolf8,
- Rex L Chisholm1,
- William L Lowe1
- 1Northwestern University, Chicago, Illinois, USA
- 2Vanderbilt University, Nashville, Tennessee, USA
- 3Marshield Clinic, Marshfield, Wisconsin, USA
- 4Mayo Clinic, Rochester, Minnesota, USA
- 5University of Washington Seattle, Seattle, Washington, USA
- 6Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
- 7Group Health Cooperative, Seattle, Washington, USA
- 8Division of Genetics, Children's Hospital Boston, Boston, Massachusetts, USA
- Correspondence to Abel N Kho, Northwestern University, Division of General Internal Medicine, 750 N. Lake Shore Drive, 10th Floor, Chicago, IL 60611, USA;
Contributors ANK developed the algorithm, collected and analyzed data, performed chart reviews, and wrote the manuscript. MGH conducted genetic analyses, and wrote the manuscript. LRT conducted genetic analyses, and reviewed and edited the manuscript. JAP collected and analyzed data, and wrote the manuscript. WKT created the standardized data and workflow within KNIME. LLA conducted genetic analyses. JCD collected data, performed chart reviews and reviewed and edited the manuscript. PLP collected and analyzed data and reviewed and edited the manuscript. AWM collected and analyzed data. WQW collected and analyzed data and reviewed and edited the manuscript. SJB collected and analyzed data. CGC reviewed and edited the manuscript. CLL collected and analyzed data. GPJ collected and analyzed data. DRC performed genetic analyses. CSC collected and analyzed data. KMN reviewed and edited the manuscript. WAW reviewed and edited the manuscript, RLC reviewed and edited the manuscript. WLL developed the algorithm and wrote the manuscript.
- Received 22 June 2011
- Accepted 27 October 2011
- Published Online First 19 November 2011
Objective Genome-wide association studies (GWAS) require high specificity and large numbers of subjects to identify genotype–phenotype correlations accurately. The aim of this study was to identify type 2 diabetes (T2D) cases and controls for a GWAS, using data captured through routine clinical care across five institutions using different electronic medical record (EMR) systems.
Materials and Methods An algorithm was developed to identify T2D cases and controls based on a combination of diagnoses, medications, and laboratory results. The performance of the algorithm was validated at three of the five participating institutions compared against clinician review. A GWAS was subsequently performed using cases and controls identified by the algorithm, with samples pooled across all five institutions.
Results The algorithm achieved 98% and 100% positive predictive values for the identification of diabetic cases and controls, respectively, as compared against clinician review. By standardizing and applying the algorithm across institutions, 3353 cases and 3352 controls were identified. Subsequent GWAS using data from five institutions replicated the TCF7L2 gene variant (rs7903146) previously associated with T2D.
Discussion By applying stringent criteria to EMR data collected through routine clinical care, cases and controls for a GWAS were identified that subsequently replicated a known genetic variant. The use of standard terminologies to define data elements enabled pooling of subjects and data across five different institutions to achieve the robust numbers required for GWAS.
Conclusions An algorithm using commonly available data from five different EMR can accurately identify T2D cases and controls for genetic study across multiple institutions.
- application of biological knowledge to clinical care
- biomedical informatics
- clinical phenotyping
- controlled terminologies and vocabularies
- data mining
- EMR secondary and meaningful use
- genetic epidemiology
- genome-wide association studies
- HIT data standards
- improving the education and skills training of health professionals
- infection control
- information retrieval
- knowledge representations
- linking the genotype and phenotype
- medical informatics
- natural-language processing
- translational research
Type 2 diabetes (T2D) is an increasing public health problem.1 2 Although environmental factors including diet and physical activity contribute to the etiology of T2D, a genetic contribution to T2D has been well documented.3–6 Genome-wide association studies (GWAS) have been most effective in identifying T2D susceptibility genes, although susceptibility alleles identified through these approaches have typically conferred small increases in risk requiring large numbers of subjects to identify susceptibility genes.7
To date, the traditional approach to case–control studies has been to recruit and phenotype study subjects prospectively. An alternative and more efficient approach would be to identify large numbers of cases and controls among patients receiving routine medical care. Until recently, it was unclear whether data collected through routine clinical care could achieve similar data quality compared with prospective study collection. The Electronic Medical Records and Genomics (eMERGE) consortium formed in 2007 to investigate if electronic medical record (EMR) linked DNA biorepositories can be leveraged for high-throughput genomic research.8 One of eMERGE's main goals is to assess whether EMR used in routine clinical care provide suitable data to identify individuals with specific phenotypes for GWAS. To date, the eMERGE consortium has successfully developed and validated several algorithms to identify accurately individuals with specific phenotypes.9–11 With recent initiatives promoting the widespread adoption of EMR, algorithms leveraging EMR-derived data for the identification of phenotypes may take on increasing importance.12 13
Candidate gene and GWAS have demonstrated a large number of genetic variants that contribute to T2D susceptibility providing known targets to test our hypothesis.3–7 14 15 Previous studies have documented that data captured through routine clinical care as part of the EMR can successfully identify patients with diabetes,16 17 but T2D is a particularly challenging phenotype. First, it must be distinguished from type 1 diabetes (T1D), which has a similar phenotype of hyperglycemia and shares at least one treatment, insulin, with T2D. However, the genetic underpinnings of T1D and T2D differ.18 19 Second, although T2D is increasingly common in youth and young adults, onset is typically later in life, complicating the identification of a control group at low risk of T2D.
This study aimed to develop an algorithm, using commonly collected data across multiple EMR systems, to identify individuals with T2D and test the hypothesis that EMR-derived phenotypes can be used as an alternative approach to the prospective collection of disease cohorts to identify genetic variants associated with T2D. Described here is an algorithm that successfully identified a multi-ethnic cohort of T2D cases and controls across five institutions and permitted replication of the TCF7L2 variant rs7903146, the polymorphisms most strongly genome-wide associated and replicated with T2D.
Research design and methods
eMERGE site overview
Five institutions participated in this study. All institutions obtained appropriate approval from their respective institutional review boards, and made use of a common data use agreement to enable data sharing between institutions. Each institution used an EMR for documentation of routine clinical care linked to a research specimen biorepository. Table 1 lists key features of each institution's EMR and biorepository. Additional details have been published previously.8 20 Notably, each site obtained appropriate patient consent for all study participants, with one site, Vanderbilt University (VU) making use of an opt-out consent model.21 Study participants represent the subset of patients who receive routine clinical care at study institutions, and also consented to participation within the institutional biorepository. Each eMERGE center selected a primary phenotype for investigation; T2D was led by Northwestern University (NU). The algorithm described below was used to identify all possible T2D case and control individuals at NU, and supplemented with all possible African ancestry cases and controls at VU. These samples were then supplemented with T2D cases and controls identified using the same algorithm from other eMERGE sites where individuals were selected for genotyping using independently derived algorithms for phenotypes of interest at that particular institution (eg, QRS duration at VU, cataracts at Marshfield, vascular disease at Mayo Clinic and dementia at Group Health Cooperative).
We used the existing clinical diagnostic criteria developed by the American Diabetes Association to develop an approach to identify T2D using commonly captured EMR data, including diagnostic codes, medications, and laboratory test results. T2D is typically diagnosed clinically by documenting hyperglycemia, a fasting glucose greater than or equal to 126 mg/dl or random glucose greater than or equal to 200 mg/dl. Our study focused on a non-pregnant adult population and did not utilize the results of oral glucose tolerance tests, which are most commonly used to screen for gestational diabetes. Each site used as many years of EMR data as were available for their study population.
To ensure comparability across sites, we identified the appropriate national standards to define diagnoses, medications, and laboratory tests (see supplementary appendices 1–3, available online only). All sites utilized International Classification of Diseases, 9th revision, clinical modification (ICD-9-CM) diagnostic codes. We defined medications using unique RxNorm codes at an ingredient level and defined laboratory tests using the logical observations identifiers names and codes (LOINC) standard.22 We identified the ‘best fit’ LOINC codes by units of measurement and overall frequency of clinical use. We included all patients with ICD-9-CM codes of 250.x0 or 250.x2, except for codes 250.10 and 250.12 (indicative of T2D with ketoacidosis, a condition also closely associated with T1D), patients on T2D medications and/or insulin at any time, and all patients with abnormal glucose (>200 mg/dl) or hemaglobin A1c (HbA1c; ≥6.5%) laboratory test results.
The algorithm was originally developed at an institution with primarily structured data housed within a clinical data warehouse, easily accessed through Structured Query Language (SQL) queries. Each site adapted the algorithm to suit the clinical data stored within their institutional EMR, and to take advantage of local data extraction and analytical tools. All sites utilized a data warehouse, separate from their transactional EMR system to execute the algorithm and avoid impacting EMR performance. Subsequent sites required varying degrees of natural language processing to extract structured data from otherwise unstructured free text clinical notes. One site utilized all diabetes codes due to a noted consistent billing pattern of using ICD-9-CM code 250.00 for both T1D and T2D patients.
In order to increase algorithm specificity and reduce the risk that an outlying observation might inadvertently capture a miscoded diagnosis, we required some redundancy in diagnostic criteria. We required cases with a T2D diagnosis code to have either an abnormal laboratory test or a prescription for a T2D medication. We required cases without a T2D diagnosis code in the EMR to have documentation of both a prescription for a T2D medication and an abnormal glucose (random glucose >200 mg/dl, fasting glucose >125 mg/dl) or a HbA1c laboratory test result of 6.5% or greater (figure 1).
Patients with a diabetes diagnosis (ICD-9-CM 250.xx) and only documented as on insulin predictably proved the most difficult to categorize as T1D or T2D patients. To differentiate between T1D and T2D in subjects who were treated with insulin, we required that patients treated with insulin alone either have a past prescription for a T2D medication or meet the following criteria: no T1D diagnoses and greater than or equal to two T2D diagnoses entered by a clinician (ie, not billing coders) on different dates.
We similarly developed an algorithm to identify control subjects without diabetes (figure 2). We excluded patients with any diabetes diagnosis (ICD-9-CM codes 250.xx) or any of the diagnoses listed in supplementary table 1 (available online only), patients on insulin or any of the medications listed in supplementary table 2 (available online only), patients who used any diabetic supplies (ie, insulin syringes, glucose monitors), and patients with any abnormal glucose (≥110 mg/dl) or HbA1c (≥6.0%) laboratory values. We also excluded patients with a family history of diabetes, either in the EMR or in questionnaire data if available. We also required controls to have had at least one normal glucose measurement and at least two in-person clinician encounters to ensure that patients had sufficient data in their EMR to determine confidently that they did not have diabetes. Figure 2 depicts the final algorithm for choosing control subjects.
As an interim step to validate the results of our algorithms before genotyping, three sites conducted a blinded chart review of at least 100 total cases and controls, with one site conducting an additional 50 chart reviews for a related study. Two sites (NU, VU) utilized clinician reviewers, and one site (MCRF) used trained chart reviewers. We reviewed charts of patients identified as either cases or controls in order to assess completely the positive predictive value (PPV) of our algorithm to identify cases and controls accurately for subsequent GWAS. Using manual chart review as a comparison standard, we generated a PPV for both cases and controls, for both iterations of the automated algorithms. Statistical analyses were performed using R,23 specifically the epiR24 package.
Genotyping was performed at the Broad Institute and Center for Inherited Disease Research on the Illumina 660W and 1M Bead Chips (Illumina Inc, San Diego, CA, USA). Genotype cleaning and quality control was performed collaboratively by all five sites using a previously described approach.25 Genotype data for rs7903146 in TCF7L2 on chromosome 10 from individuals passing quality control, and identified as a T2D case or control was used for this analysis to test the validity of the algorithm. Demographic differences between cohorts was assessed by analysis of variance in R.23
Associations between genotype and T2D case–control status were assessed through linear regression assuming an additive model, and adjusting for site, age, sex, body nass index (BMI), and ancestry using PLINK.26 We used age and median BMI at the time of earliest diabetes diagnosis or diabetes medication prescription for cases, and age and median BMI at the time of enrollment in the biobank (age when biospecimen was collected) for controls. We excluded BMI measures collected during pregnancy for both cases and controls. The OR, SE of the OR, and p value from each cohort from each of the five European ancestry cohorts were combined in a meta-analysis weighting each strata by the number of samples using default settings in PLINK. This was repeated for the two African ancestry cohorts, and all seven of the cohorts in total. We report p values and OR for a fixed-effect model.
In combination, the five sites identified a total of 3353 cases and 3352 controls, of which 3266 cases and 3286 controls passed genomic quality control testing and were included in the genetic analysis. Table 2 lists the demographics from each site for samples that passed genomic quality control testing. Each cohort was significantly different (p<0.001) from each other with respect to age, sex, and BMI, although we adjusted for these differences in our subsequent genetic analysis. Table 3 summarizes validation results at three participating sites that ranged from 98.2% to 100% PPV for case identification, and were 98–100% PPV for the identification of controls. The association results are presented in table 4. Three of the five sites (Group Health, Marshfield, and NU) produced moderate to strong associations between rs7903146 and T2D in European ancestry cohorts (p value range 0.002 to 9.27×10−5). At two sites, Mayo Clinic and VU, associations trended in the same direction and approached nominal significance (p=0.1177 and 0.0601, respectively). Allele frequencies for the VU African-American (AA) cohort were similar to the remaining European-American (EA) cohorts, but due to a much smaller sample size failed to reach significance. Among the two AA cohorts, NU and VU yielded similar case and control frequencies of the associated (T) allele at rs7903146, but the small NU AA cohort (N=294) yielded a p value of 0.0867, while the larger VU AA cohort achieved high significance (p=2.25×10−6), with the difference attributable to sample size. Our cross-site cohort meta-analyses produced very similar results with even smaller p values that are all highly significant: p=2.98×10−10 for the EA, p=5.30×10−7 for AA, and p=2.05×10−15 for all subjects across all sites. The cross cohort OR for s7903146 was 1.46, which is similar to what has been found previously in other populations.27
In this study, we developed and validated an algorithm to identify cases with T2D and controls using standardized data elements captured through routine clinical care across five different EMR systems. Despite variations in data capture and completeness across the different systems, by applying stringent minimum criteria, and clear definition of data elements through an iterative process, we developed a final algorithm with a 98% PPV for cases and a 100% PPV for controls. We subsequently used identified samples pooled across sites to perform a GWAS.
The association tests between rs7903146 and T2D in the five EMR-derived cohorts yielded similar results to those from purposefully collected T2D case and control cohorts. In a recent meta-analysis of 29 195 T2D control subjects and 17 202 T2D case subjects from 27 populations spanning the globe Cauchi et al27 found the OR for developing T2D was 1.46 per copy of the rs7903146 T allele. We generated the exact same OR point estimate (1.46) for our EMR-derived samples using meta-analytical techniques in the pooled samples. Perhaps more importantly, our work demonstrates the power that can be achieved by combining samples across sites, evidenced by the highly significant p values from the cross-cohort analyses.
Our work expands on earlier studies to identify patients with diabetes from EMR. Previously, Wilke et al17 at Marshfield Clinic developed an effective algorithm to identify diabetes mellitus patients, but did not specifically differentiate between T1D and T2D. Other studies have utilized laboratory values, or diagnoses, laboratory tests and natural language processing to achieve high specificity for the identification of T1D and T2D.16 28 A related study used diagnoses and medications to identify patients with conditions that are risk factors for T2D, which were in turn used to identify patients with undiagnosed diabetes.29
We identified and addressed a number of specific challenges when developing the algorithms. We created specific definitions for cases and controls to avoid confounding by the inclusion of cases with T1D and, as much as possible, controls at risk of T2D, which has not, as yet, manifested itself. In EMR, fasting status at the time of blood draw for a patient was frequently not available. We therefore assumed that all glucose laboratory test results were not taken during the fasting state, so we used a lower glucose cut-off for controls, which resulted in lower sensitivity but higher specificity. In developing the final algorithm, a potential source of bias was recognized in that initially T2D subjects who were treated with insulin alone were excluded, although subjects with diabetes on insulin together with one of the diabetes medications listed above were eligible for inclusion. This approach would select against T2D subjects with significant pancreatic β-cell failure. Another problem presented by patients on insulin alone and an ICD-9-CM code for T2D is that some of these patients could represent patients with T1D, which was misclassified as T2D because of the age of onset or other issues. To address this, we identified patients on insulin alone as cases if they had been on a T2D medication in the past, or if they had at least two visits (on different dates) with a clinician who entered T2D diagnoses (ie, in the problem list or the encounter diagnosis).
Identifying controls presented a challenge to ensure that the control group was not ‘contaminated’ with cases, which would negatively impact power in genetic studies. We operated on the principle that absence of a diagnosis, prescribed medications, laboratory results, or other data in the EMR did not necessarily correlate with true patient status, but may reflect the selective capture of data within the EMR. Particularly at tertiary care centers, some patients receive only a portion of their care at the center. To address this challenge, we required that controls have a minimal amount of data represented in the EMR. In particular, we required controls to have had glucose testing with normal results at least once and to have at least two in-person clinician encounters. Moreover, to eliminate younger patients at increased risk of T2D but in whom the disease was not manifest, potential controls with a family history of diabetes were excluded. Another potential confounder was patients with diet controlled diabetes, although our algorithm was developed with the assumption that these patients would either have an ICD-9-CM code for T2D or an abnormal laboratory test result, which would exclude them from the control group.
Lack of standardization across EMR posed a challenge for the cross-site implementation and even within a given site where different EMR were in use. As a consortium, we identified the consolidated health informatics standards as the common lingua franca to achieve comparability of data across sites. For medications, we mapped medications to RxNORM codes at the generic name level as the common link between sites.30 For purposes of easier cross-institution sharing, we identified ingredient level RxNORM codes (included in the supplementary appendix, available online only) to reduce the total number of codes. We used LOINC codes specifically to define tests for glucose and HbA1C levels and ICD-9-CM codes for diagnoses. Despite these efforts the portability of algorithms across diverse sites poses a significant challenge and our future work is focused on developing methods to scale phenotyping more broadly. For example, we noted significant differences across sites in algorithm computing time, ranging from less than 10 s at a site using an optimized commercial data warehouse to 40 h at a site sequentially extracting categories of data using statistical software on their data warehouse. To this end, we include a link to our data dictionary, sample SQL code, and a data workflow built on an open source data mining tool for other investigators to explore: https://www.mc.vanderbilt.edu/victr/dcc/projects/acc/index.php/Library_of_Phenotype_Algorithms#Type_II_Diabetes.
Our study had a number of limitations. Study sites represent institutions with a significant research focus, and this may affect how data are routinely captured within the EMR. Study sites varied in the number of years of data available in the EMR and the degree of care fragmentation. Preliminary evidence suggests that the absence of longitudinal data and fragmentation of care across sites may decrease the specificity of our algorithm. Additional studies are under way to quantify these effects in greater detail. Rates of T2D varied across sites from 1.0% of the total available biorepository to 14.8% at Mayo, compared with an approximate rate of 8% for diabetes (all types) for the general population.31 Rate differences are likely to be due to bias in sample selection for genotyping, for which only NU selected all possible T2D cases and controls for genotyping. Other sites performed the T2D case and control algorithms on their already genotyped cohorts, which were selected for genotyping based on their suitability for other phenotypes (eg, QRS duration at VU, cataracts at Marshfield, vascular disease at Mayo Clinic and dementia at Group Health Cooperative/UW). Other sources of bias include variation in biorepository recruitment (eg, Mayo Clinic's biorepository focused on patients with vascular disease, strongly associated with T2D) and variation in local coding practices.32
While the Mayo EA, VU EA, and NU AA results do not reach nominal significance they do approach significance (p=0.11, 0.06, and 0.08, respectively) and all trend in the same direction as the remaining subcohorts. The most likely explanation for the VU EA and NU AA lack of significance is reduced power from relatively small sample size for a GWAS. The Mayo EA lack of significance may be due to the selection bias, as these samples were not selected for genotyping based on the T2D case and control algorithm, but rather an algorithm designed to identify cardiovascular disease phenotypes. As noted, 14.8% of this biased Mayo cohort were identified as a T2D case, significantly higher than the national population prevalence of this disease. We suspect increased co-occurrence of cardiovascular and metabolic diseases may contribute to the reduction in significance through an increased prevalence of undiagnosed T2D among the controls. Importantly, despite the failure to achieve significance for replication of TCF7L2 at individual sites, pooling samples across sites achieved highly significant results, supporting our collective approach.
In conclusion, we describe a practical approach to the identification of T2D cases and controls for GWAS using data captured in routine clinical care across five distinct EMR. To achieve the high specificity required for GWAS, we refined an algorithm over multiple iterations, and applied stringent criteria and nationally recognized coding standards to facilitate portability across different EMR. Although the overall number of cases and controls decreased with the increased specificity needed for GWAS, by generalizing the algorithm across diverse EMR we identified the large number of cases and controls needed for a well-powered GWAS, and generated the exact OR point estimate we expected from the literature. Applying this approach across a large number of institutions provides an alternative approach for generating a large cohort of T2D cases and controls to understand better the associations between genetics and expressions of disease.
Funding The eMERGE Network was initiated and funded by NHGRI, with additional funding from NIGMS through the following grants: U01-HG-004610 (Group Health Cooperative); U01-HG-004608 (Marshfield Clinic); U01-HG-04599 (Mayo Clinic); U01HG004609 (Northwestern University); U01-HG-04603 (Vanderbilt University, also serving as the Coordinating Center), and the State of Washington Life Sciences Discovery Fund award to the Northwest Institute of Medical Genetics. The Northwestern University Enterprise Data Warehouse was funded in part by a grant from the National Center for Research Resources, UL1RR025741. The genetic data are deposited in dbGaP (accession numbers phs000170, phs000188, phs000203, phs000234, phs000237).
Competing interests None.
Patient consent Obtained.
Ethics approval Ethics approval was provided by institutional review boards from all participating sites (Northwestern University, Vanderbilt University, Group Health Cooperative, Marshfield Clinic, Mayo Clinic).
Provenance and peer review Not commissioned; externally peer reviewed.