Importance of multi-modal approaches to effectively identify cataract cases from electronic health records

OBJECTIVE There is increasing interest in using electronic health records (EHRs) to identify subjects for genomic association studies, due in part to the availability of large amounts of clinical data and the expected cost efficiencies of subject identification. We describe the construction and validation of an EHR-based algorithm to identify subjects with age-related cataracts. MATERIALS AND METHODS We used a multi-modal strategy consisting of structured database querying, natural language processing on free-text documents, and optical character recognition on scanned clinical images to identify cataract subjects and related cataract attributes. Extensive validation on 3657 subjects compared the multi-modal results to manual chart review. The algorithm was also implemented at participating electronic MEdical Records and GEnomics (eMERGE) institutions. RESULTS An EHR-based cataract phenotyping algorithm was successfully developed and validated, resulting in positive predictive values (PPVs) >95%. The multi-modal approach increased the identification of cataract subject attributes by a factor of three compared to single-mode approaches while maintaining high PPV. Components of the cataract algorithm were successfully deployed at three other institutions with similar accuracy. DISCUSSION A multi-modal strategy incorporating optical character recognition and natural language processing may increase the number of cases identified while maintaining similar PPVs. Such algorithms, however, require that the needed information be embedded within clinical documents. CONCLUSION We have demonstrated that algorithms to identify and characterize cataracts can be developed utilizing data collected via the EHR. These algorithms provide a high level of accuracy even when implemented across multiple EHRs and institutional boundaries.

[1]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[2]  D. Altman,et al.  Statistics Notes: Diagnostic tests 1: sensitivity and specificity , 1994 .

[3]  N. Congdon,et al.  Prevalence of cataract and pseudophakia/aphakia among adults in the United States. , 2004, Archives of ophthalmology.

[4]  Marylyn D. Ritchie,et al.  Return of individual research results from genome-wide association studies: experience of the Electronic Medical Records and Genomics (eMERGE) Network , 2012, Genetics in Medicine.

[5]  D. Roden,et al.  The Emerging Role of Electronic Medical Records in Pharmacogenomics , 2011, Clinical pharmacology and therapeutics.

[6]  K B Jacobs,et al.  Modeling and dissection of longitudinal blood pressure and hypertension phenotypes in genetic epidemiological studies , 2003, Genetic epidemiology.

[7]  Carol Friedman,et al.  The Columbia Integrated Speech Interpretation System (CISIS) , 1995 .

[8]  Melissa A. Basford,et al.  Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. , 2010, American journal of human genetics.

[9]  Venu Govindaraju,et al.  Handwriting analysis of pre-hospital care reports , 2004, Proceedings. 17th IEEE Symposium on Computer-Based Medical Systems.

[10]  Jin Fan,et al.  Leveraging informatics for genetic studies: use of the electronic medical record to enable a genome-wide association study of peripheral arterial disease , 2010, J. Am. Medical Informatics Assoc..

[11]  Melissa A. Basford,et al.  Identification of Genomic Predictors of Atrioventricular Conduction: Using Electronic Medical Records as a Tool for Genome Science , 2010, Circulation.

[12]  H. Taylor,et al.  Incidence and progression of cataract in the Melbourne Visual Impairment Project. , 2003, American journal of ophthalmology.

[13]  Thomas G. Schulze,et al.  Defining the Phenotype in Human Genetic Studies: Forward Genetics and Reverse Phenotyping , 2005, Human Heredity.

[14]  Richard L Berg,et al.  Construction of atorvastatin dose-response relationships using data from a large population-based DNA biobank. , 2007, Basic & clinical pharmacology & toxicology.

[15]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[16]  Peggy L. Peissig,et al.  Study of Effect of Drug Lexicons on Medication Extraction from Electronic Medical Records , 2004, Pacific Symposium on Biocomputing.

[17]  Yang Li,et al.  Analysis of Tiling Microarray Data by Learning Vector Quantization and Relevance Learning , 2007, IDEAL.

[18]  Venu Govindaraju Emergency medicine, disease surveillance, and informatics , 2005, DG.O.

[19]  J Starren,et al.  Notations for high efficiency data presentation in mammography. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[20]  Peggy Peissig,et al.  The Marshfield Clinic Personalized Medicine Research Project: 2008 scientific update and lessons learned in the first 6 years. , 2008, Personalized medicine.

[21]  A D Négrel,et al.  Available data on blindness (update 1994) , 1995, Ophthalmic epidemiology.

[22]  Lucila Ohno-Machado,et al.  Natural language processing: an introduction , 2011, J. Am. Medical Informatics Assoc..

[23]  Christopher G Chute,et al.  Discovering peripheral arterial disease cases from radiology notes using natural language processing. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[24]  Teri A Manolio,et al.  Collaborative genome-wide association studies of diverse diseases: programs of the NHGRI's office of population genomics. , 2009, Pharmacogenomics.

[25]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[26]  Russell A Wilke,et al.  Biobanking and pharmacogenomics. , 2010, Pharmacogenomics.

[27]  Munir Pirmohamed,et al.  Pharmacogenomics: the importance of accurate phenotypes. , 2010, Pharmacogenomics.

[28]  Lois Delcambre,et al.  Proceedings of the 2005 National Conference on Digital Government Research, DG.O 2005, Atlanta, Georgia, USA, May 15-18, 2005 , 2005, DG.O.

[29]  R. Klein,et al.  Causes and prevalence of visual impairment among adults in the United States. , 2004, Archives of ophthalmology.

[30]  J Starren,et al.  Architectural requirements for a multipurpose natural language processor in the clinical environment. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[31]  Peggy L. Peissig,et al.  Development of an optical character recognition pipeline for handwritten form fields from an electronic health record , 2012, J. Am. Medical Informatics Assoc..

[32]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[33]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[34]  C. McCarty,et al.  Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. , 2005, Personalized medicine.

[35]  L. Ellwein,et al.  Use of eye care and associated charges among the Medicare population: 1991-1998. , 2002, Archives of ophthalmology.

[36]  D. Altman,et al.  Statistics Notes: Diagnostic tests 2: predictive values , 1994, BMJ.

[37]  Peggy L Peissig,et al.  Cataract research using electronic health records , 2011, BMC ophthalmology.

[38]  George Hripcsak,et al.  Automated encoding of clinical documents based on natural language processing. , 2004, Journal of the American Medical Informatics Association : JAMIA.

[39]  Maciej Piasecki,et al.  Correction of Medical Handwriting OCR Based on Semantic Similarity , 2007, IDEAL.

[40]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[41]  M. Wojczynski,et al.  Definition of phenotype. , 2008, Advances in genetics.