An Interoperable Similarity-based Cohort Identification Method Using the OMOP Common Data Model Version 5.0

Cohort identification for clinical studies tends to be laborious, time-consuming, and expensive. Developing automated or semi-automated methods for cohort identification is one of the “holy grails” in the field of biomedical informatics. We propose a high-throughput similarity-based cohort identification algorithm by applying numerical abstractions on electronic health records (EHR) data. We implement this algorithm using the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), which enables sites using this standardized EHR data representation to avail this algorithm with minimum effort for local implementation. We validate its performance for a retrospective cohort identification task on six clinical trials conducted at the Columbia University Medical Center. Our algorithm achieves an average area under the curve (AUC) of 0.966 and an average Precision at 5 of 0.983. This interoperable method promises to achieve efficient cohort identification in EHR databases. We discuss suitable applications of our method and its limitations and propose warranted future work.

[1]  Shuang Wang,et al.  Hidden in plain sight: bias towards sick patients when sampling patients with sufficient electronic health record data for research , 2014, BMC Medical Informatics and Decision Making.

[2]  John E. Mattison,et al.  Review: The HL7 Clinical Document Architecture , 2001, J. Am. Medical Informatics Assoc..

[3]  Kevin Donnelly,et al.  SNOMED-CT: The advanced terminology and coding system for eHealth. , 2006, Studies in health technology and informatics.

[4]  Noémie Elhadad,et al.  Automated methods for the summarization of electronic health records , 2015, J. Am. Medical Informatics Assoc..

[5]  D. Weiss,et al.  Planning patient recruitment: fantasy and reality. , 1984, Statistics in medicine.

[6]  David Glasspool,et al.  Comparing semi-automatic systems for recruitment of patients to clinical trials , 2011, Int. J. Medical Informatics.

[7]  Riccardo Miotto,et al.  Case-based reasoning using electronic health records efficiently identifies eligible patients for clinical trials , 2015, J. Am. Medical Informatics Assoc..

[8]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[9]  N. Shah,et al.  A 'green button' for using aggregate patient data at the point of care. , 2014, Health affairs.

[10]  Noémie Elhadad,et al.  Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies , 2013, BMC Bioinformatics.

[11]  Jianying Hu,et al.  Towards Personalized Medicine: Leveraging Patient Similarity and Drug Similarity Analytics , 2014, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[12]  George Hripcsak,et al.  Model Selection For EHR Laboratory Tests Preserving Healthcare Context and Underlying Physiology , 2015, AMIA.

[13]  W. Hersh Adding value to the electronic health record through secondary use of data for quality assurance, research, and surveillance. , 2007, The American journal of managed care.

[14]  Stephen T. C. Wong,et al.  A gene signature based method for identifying subtypes and subtype-specific drivers in cancer with an application to medulloblastoma , 2012, 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).

[15]  J. M. Guralnik,et al.  Drug data coding and analysis in epidemiologic studies , 1994, European Journal of Epidemiology.

[16]  H. Pollard On the Relative Stability of the Median and Arithmetic Mean, with Particular Reference to Certain Frequency Distributions Which Can Be Dissected into Normal Distributions , 1934 .

[17]  Stephen B. Johnson,et al.  A review of approaches to identifying patient phenotype cohorts using electronic health records , 2013, J. Am. Medical Informatics Assoc..

[18]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[19]  Fei Wang,et al.  Supervised patient similarity measure of heterogeneous patient records , 2012, SKDD.

[20]  Cynthia Rudin,et al.  Sequential event prediction , 2013, Machine Learning.

[21]  Huilong Duan,et al.  Similarity Measure Between Patient Traces for Clinical Pathway Analysis: Problem, Method, and Applications , 2014, IEEE Journal of Biomedical and Health Informatics.

[22]  George Hripcsak,et al.  Caveats for the use of operational electronic health record data in comparative effectiveness research. , 2013, Medical care.

[23]  J. Denny,et al.  Naïve Electronic Health Record phenotype identification for Rheumatoid arthritis. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[24]  C. McDonald,et al.  LOINC, a universal standard for identifying laboratory observations: a 5-year update. , 2003, Clinical chemistry.

[25]  David M Kent,et al.  Assessing and reporting heterogeneity in treatment effects in clinical trials: a proposal , 2010, Trials.

[26]  G Hripcsak,et al.  Similarity-Based Modeling Applied to Signal Detection in Pharmacovigilance , 2014, CPT: pharmacometrics & systems pharmacology.

[27]  E. Dudewicz,et al.  Fitting Statistical Distributions: The Generalized Lambda Distribution and Generalized Bootstrap Methods , 2019 .

[28]  Patrick B. Ryan,et al.  Validation of a common data model for active safety surveillance research , 2012, J. Am. Medical Informatics Assoc..

[29]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[30]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[31]  Christopher G Chute,et al.  Analyzing the heterogeneity and complexity of Electronic Health Record oriented phenotyping algorithms. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[32]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[33]  John Zeleznikow,et al.  The Application of Case-Based Reasoning to the Tasks of Health Care Planning , 1993, EWCBR.

[34]  Chunhua Weng,et al.  Case Report: Electronic Screening Improves Efficiency in Clinical Trial Recruitment , 2009, J. Am. Medical Informatics Assoc..

[35]  Cynthia R. Marling,et al.  Case-Based Reasoning in the Care of Alzheimer's Disease Patients , 2001, ICCBR.

[36]  Stefan V. Pantazi,et al.  Case-based medical informatics , 2004, BMC Medical Informatics Decis. Mak..

[37]  Abigail R. Averbach,et al.  Race/ethnicity and OMB Directive 15: implications for state public health practice. , 2000, American journal of public health.

[38]  C. Westhoff,et al.  Three-year efficacy and safety of a new 52-mg levonorgestrel-releasing intrauterine system. , 2015, Contraception.

[39]  George Hripcsak,et al.  Exploiting time in electronic health record correlations , 2011, J. Am. Medical Informatics Assoc..

[40]  Shuang Wang,et al.  GIST 2.0: A scalable multi-trait metric for quantifying population representativeness of individual clinical studies , 2016, J. Biomed. Informatics.

[41]  Charles Safran,et al.  Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[42]  Morton B. Brown,et al.  372: The Anova and Multiple Comparisons for Data with Heterogeneous Variances , 1974 .

[43]  L. D. Xu Case based reasoning , 1995 .

[44]  George Hripcsak,et al.  Designing an Introspective, Multipurpose, Controlled Medical Vocabulary. , 1989 .

[45]  Shuang Wang,et al.  Differentially private genome data dissemination through top-down specialization , 2014, BMC Medical Informatics and Decision Making.

[46]  D. Kereiakes,et al.  A prospective evaluation of the safety and efficacy of the TAXUS Element paclitaxel-eluting coronary stent system for the treatment of de novo coronary artery lesions: Design and statistical methods of the PERSEUS clinical program , 2010, Trials.

[47]  Hans-Ulrich Prokosch,et al.  Evaluating predictive modeling algorithms to assess patient eligibility for clinical trials from routine data , 2013, BMC Medical Informatics and Decision Making.

[48]  Kun-Tze Chen,et al.  Assembling contigs in draft genomes using reversals and block-interchanges , 2013, BMC Bioinformatics.

[49]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[50]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.