Importance of multi-modal approaches to effectively identify cataract cases from electronic health records
- Peggy L Peissig1,
- Luke V Rasmussen1,2,
- Richard L Berg1,
- James G Linneman1,
- Catherine A McCarty3,4,
- Carol Waudby3,
- Lin Chen5,
- Joshua C Denny6,7,
- Russell A Wilke8,
- Jyotishman Pathak9,
- David Carrell10,
- Abel N Kho11,
- Justin B Starren2
- 1Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA
- 2Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
- 3Center for Human Genetics, Marshfield Clinic Research Foundation, Marshfield, Wisconsin, USA
- 4Essentia Institute of Rural Health, Duluth, Minnesota, USA
- 5Department of Ophthalmology, Marshfield Clinic, Marshfield, Wisconsin, USA
- 6Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
- 7Department of Medicine, Vanderbilt University School of Medicine, Nashville, Tennessee, USA
- 8Department of Medicine, Vanderbilt University, Nashville, Tennessee, USA
- 9Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
- 10Group Health Research Institute, Seattle, Washington, USA
- 11Departments of Medicine and Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
- Correspondence to Peggy L Peissig, Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, 1000 N. Oak Avenue (MLR), Marshfield, WI 54449, USA; peissig.peggy{at}marshfieldclinic.org
-
Contributors PP prepared the initial draft of the paper. RB carried out the statistical analyses. JL developed the electronic algorithm to identify cases/controls and created the databases and data sets. CW completed the data abstraction. LR configured and executed the NLP and OCR programs. PP and JS oversaw the informatics components of the study. LC, RW, and CMcC were the content experts and provided training for data abstraction. CMcC was Principal Investigator and is responsible for the conception, design, and analysis plan. JD, JP, DC, and AK oversaw informatics activities at other eMERGE institutions. All authors read and approved the final manuscript.
- Received 30 June 2011
- Accepted 10 November 2011
Abstract
Objective There is increasing interest in using electronic health records (EHRs) to identify subjects for genomic association studies, due in part to the availability of large amounts of clinical data and the expected cost efficiencies of subject identification. We describe the construction and validation of an EHR-based algorithm to identify subjects with age-related cataracts.
Materials and methods We used a multi-modal strategy consisting of structured database querying, natural language processing on free-text documents, and optical character recognition on scanned clinical images to identify cataract subjects and related cataract attributes. Extensive validation on 3657 subjects compared the multi-modal results to manual chart review. The algorithm was also implemented at participating electronic MEdical Records and GEnomics (eMERGE) institutions.
Results An EHR-based cataract phenotyping algorithm was successfully developed and validated, resulting in positive predictive values (PPVs) >95%. The multi-modal approach increased the identification of cataract subject attributes by a factor of three compared to single-mode approaches while maintaining high PPV. Components of the cataract algorithm were successfully deployed at three other institutions with similar accuracy.
Discussion A multi-modal strategy incorporating optical character recognition and natural language processing may increase the number of cases identified while maintaining similar PPVs. Such algorithms, however, require that the needed information be embedded within clinical documents.
Conclusion We have demonstrated that algorithms to identify and characterize cataracts can be developed utilizing data collected via the EHR. These algorithms provide a high level of accuracy even when implemented across multiple EHRs and institutional boundaries.
- Cataract
- electronic health record
- intelligent character recognition
- natural language processing
- phenotyping
- bioinformatics
- NLP
- information systems
- software engineering
- clinical research informatics
- natural-language processing
- linking the genotype and phenotype
- improving the education and skills training of health professionals
- translational research
- application of biological knowledge to clinical care
- genomics
- pharmacogenomics
- genome wide association studies
- clinical phenotyping
- ritu and pupu and 12
- medical informatics
- infection control
Footnotes
-
Funding The eMERGE Network was initiated and funded by NHGRI, with additional funding from NIGMS through the following grants: U01-HG-004610 (Group Health Cooperative), U01-HG-004608 (Marshfield Clinic), U01-HG-04599 (Mayo Clinic), U01-HG-004609 (Northwestern University), U01-HG-04603 (Vanderbilt University, also serving as the Coordinating Center); and the State of Washington Life Sciences Discovery Fund award to the Northwest Institute of Genetic Medicine.
-
Competing interests None.
-
Ethics approval Ethics approval was provided by Marshfield Clinic Institutional Review Board.
-
Provenance and peer review Not commissioned; externally peer reviewed.









