UK phenomics platform for developing and validating EHR phenotypes: CALIBER

Objective Electronic health records are a rich source of information on human diseases, but the information is variably structured, fragmented, curated using different coding systems and collected for purposes other than medical research. We describe an approach for developing, validating and sharing reproducible phenotypes from national structured Electronic Health Records (EHR) in the UK with applications for translational research. Materials and Methods We implemented a rule-based phenotyping framework, with up to six approaches of validation. We applied our framework to a sample of 15 million individuals in a national EHR data source (population based primary care, all ages) linked to hospitalization and death records in England. Data comprised continuous measurements such as blood pressure, medication information and coded diagnoses, symptoms, procedures and referrals, recorded using five controlled clinical terminologies: a) Read (primary care, subset of SNOMED-CT), b) ICD-9, ICD-10 (secondary care diagnoses and cause of mortality), c) OPCS-4 (hospital surgical procedures) and d) Gemscript Drug Codes. Results The open-access CALIBER Portal (https://www.caliberresearch.org/portal) demonstrates phenotyping algorithms for 50 diseases, syndromes, biomarkers and lifestyle risk factors and provides up to six validation layers. These phenotyping algorithms have been used by 40 national/international research groups in 60 peer-reviewed publications. Conclusion Herein, we describe the UK EHR phenomics approach, CALIBER, with initial evidence of validity and use, as an important step towards international use of UK EHR data for health research.

[1]  Spiros C. Denaxas,et al.  A chronological map of 308 physical and mental health conditions from 4 million individuals in the English National Health Service , 2019, The Lancet. Digital health.

[2]  Tina Hernandez-Boussard,et al.  Advances in Electronic Phenotyping: From Rule-Based Definitions to Machine Learning Models. , 2018, Annual review of biomedical data science.

[3]  X. Jouven,et al.  Clinically recorded heart rate and incidence of 12 coronary, cardiac, cerebrovascular and peripheral arterial diseases in 233,970 men and women: A linked electronic health record study , 2018, European journal of preventive cardiology.

[4]  R. Payne,et al.  Do GPs accurately record date of death? A UK observational analysis , 2018, BMJ Supportive & Palliative Care.

[5]  J. Escudero,et al.  Machine-learning based identification of undiagnosed dementia in primary care: a feasibility study , 2018, BJGP open.

[6]  Jim Warren,et al.  Cardiovascular disease risk prediction equations in 400 000 primary care patients in New Zealand: a derivation and validation study , 2018, The Lancet.

[7]  Damian Smedley,et al.  The 100 000 Genomes Project: bringing whole genome sequencing to the NHS , 2018, British Medical Journal.

[8]  Spiros Denaxas,et al.  Critical Care Health Informatics Collaborative (CCHIC): Data, tools and methods for reproducible research: A multi-centre UK intensive care database , 2018, Int. J. Medical Informatics.

[9]  Harry Hemingway,et al.  An electronic health records cohort study on heart failure following myocardial infarction in England: incidence and predictors , 2018, BMJ Open.

[10]  George Hripcsak,et al.  High-fidelity phenotyping: richness and freedom from bias , 2017, J. Am. Medical Informatics Assoc..

[11]  Spiros C. Denaxas,et al.  Big data from electronic health records for early and late translational cardiovascular research: challenges and potential , 2017, European heart journal.

[12]  Zina M. Ibrahim,et al.  SemEHR: A general-purpose semantic search system to surface semantic data from clinical notes for tailored care, trial recruitment, and clinical research , 2017, bioRxiv.

[13]  Tudor Groza,et al.  CogStack - experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital , 2017, bioRxiv.

[14]  Arturo Gonzalez-Izquierdo,et al.  Methods for enhancing the reproducibility of biomedical research findings using electronic health records , 2017, BioData Mining.

[15]  P. Rothwell,et al.  Age-specific risks, severity, time course, and outcome of bleeding on long-term antiplatelet treatment after vascular events: a population-based cohort study , 2017, The Lancet.

[16]  Spiros Denaxas,et al.  Evaluation of Semantic Web Technologies for Storing Computable Definitions of Electronic Health Records Phenotyping Algorithms , 2017, AMIA.

[17]  L. Smeeth,et al.  Ethnicity and the first diagnosis of a wide range of cardiovascular diseases: Associations in a linked electronic health record cohort of 1 million patients , 2017, PloS one.

[18]  A. Sheikh,et al.  Defining asthma and assessing asthma outcomes using electronic health record data: a systematic scoping review , 2017, European Respiratory Journal.

[19]  Iain E. Buchan,et al.  Clinical code set engineering for reusing EHR data for research: A review , 2017, AMIA.

[20]  Arturo Gonzalez-Izquierdo,et al.  Methods for Enhancing the Reproducibility of Observational Research Using Electronic Health Records: Preliminary Findings from the CALIBER Resource , 2017, 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS).

[21]  Spiros Denaxas,et al.  Evaluating OpenEHR for Storing Computable Representations of Electronic Health Record Phenotyping Algorithms , 2017, 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS).

[22]  C. Sudlow,et al.  Identification and validation of myocardial infarction and stroke outcomes at scale in UK Biobank , 2017, International Journal of Population Data Science.

[23]  Spiros Denaxas,et al.  Association between clinically recorded alcohol consumption and initial presentation of 12 cardiovascular diseases: population based cohort study using linked health records , 2017, British Medical Journal.

[24]  Pia Hardelid,et al.  Data Resource Profile: Hospital Episode Statistics Admitted Patient Care (HES APC) , 2017, International journal of epidemiology.

[25]  A. Hingorani,et al.  Neutrophil Counts and Initial Presentation of 12 Cardiovascular Diseases , 2017, Journal of the American College of Cardiology.

[26]  Deriving research-quality phenotypes from national electronic health records to advance precision medicine: a UK Biobank case-study , 2017, EMBC 2017.

[27]  Sheng-Chia Chung,et al.  White cell count in the normal range and short-term and long-term mortality: international comparisons of electronic health record cohorts in England and New Zealand , 2017, BMJ Open.

[28]  Spiros Denaxas,et al.  Prognostic burden of heart failure recorded in primary care, acute hospital admissions, or both: a population‐based linked electronic health record cohort study in 2.1 million people , 2016, European journal of heart failure.

[29]  Andrew J. Thomson,et al.  Opportunities and Challenges in Developing a Cohort of Patients with Type 2 Diabetes Mellitus Using Electronic Primary Care Data , 2016, PloS one.

[30]  Paul A. Harris,et al.  PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability , 2016, J. Am. Medical Informatics Assoc..

[31]  Stephen Timmons,et al.  Evolution of primary care databases in UK: a scientometric analysis of research output , 2016, BMJ Open.

[32]  Jason H. Moore,et al.  The tip of the iceberg: challenges of accessing hospital electronic health record data for biological data mining , 2016, BioData Mining.

[33]  A. Hingorani,et al.  Low eosinophil and low lymphocyte counts and the incidence of 12 cardiovascular diseases: a CALIBER cohort study , 2016, Open Heart.

[34]  Jie Xu,et al.  Developing a data element repository to support EHR-driven phenotype algorithm authoring and execution , 2016, J. Biomed. Informatics.

[35]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[36]  C. Sudlow,et al.  Defining Disease Phenotypes in Primary Care Electronic Health Records by a Machine Learning Approach: A Case Study in Identifying Rheumatoid Arthritis , 2016, PloS one.

[37]  K. Walters,et al.  Depression as a Risk Factor for the Initial Presentation of Twelve Cardiac, Cerebrovascular, and Peripheral Arterial Diseases: Data Linkage Study of 1.9 Million Women and Men , 2016, PloS one.

[38]  N. Adler,et al.  Using Electronic Health Records for Population Health Research: A Review of Methods and Applications. , 2016, Annual review of public health.

[39]  Nicholas Moore,et al.  Using big data from health records from four countries to evaluate chronic disease outcomes: a study in 114 364 survivors of myocardial infarction , 2016, European heart journal. Quality of care & clinical outcomes.

[40]  Donia Scott,et al.  Extracting information from the text of electronic medical records to improve case detection: a systematic review , 2016, J. Am. Medical Informatics Assoc..

[41]  Mary Brophy,et al.  Million Veteran Program: A mega-biobank to study genetic influences on health and disease. , 2016, Journal of clinical epidemiology.

[42]  Jie Xu,et al.  Review and evaluation of electronic health records-driven phenotype algorithm authoring tools for clinical and translational research , 2015, J. Am. Medical Informatics Assoc..

[43]  L. Smeeth,et al.  How Does Cardiovascular Disease First Present in Women and Men? , 2015, Circulation.

[44]  Paul A. Harris,et al.  Desiderata for computable representations of electronic health records-driven phenotype algorithms , 2015, J. Am. Medical Informatics Assoc..

[45]  J. Danesh,et al.  A comprehensive 1000 Genomes-based genome-wide association meta-analysis of coronary artery disease , 2016 .

[46]  Spiros C. Denaxas,et al.  Big biomedical data and cardiovascular disease research: opportunities and challenges. , 2015, European heart journal. Quality of care & clinical outcomes.

[47]  K. Bhaskaran,et al.  Data Resource Profile: Clinical Practice Research Datalink (CPRD) , 2015, International journal of epidemiology.

[48]  Harry Hemingway,et al.  Use of electronic health records to ascertain, validate and phenotype acute myocardial infarction: A systematic review and recommendations. , 2015, International journal of cardiology.

[49]  J. Denny,et al.  Extracting research-quality phenotypes from electronic health records to support precision medicine , 2015, Genome Medicine.

[50]  Peter Kraker,et al.  Research Data Explored II: the Anatomy and Reception of figshare , 2015, ArXiv.

[51]  F. Collins,et al.  A new initiative on precision medicine. , 2015, The New England journal of medicine.

[52]  Spiros Denaxas,et al.  Type 2 diabetes and incidence of cardiovascular diseases: a cohort study in 1·9 million people , 2015, The Lancet.

[53]  Hassan Khan,et al.  Resting Heart Rate and Risk of Incident Heart Failure: Three Prospective Cohort Studies and a Systematic Meta‐Analysis , 2015, Journal of the American Heart Association.

[54]  Harry Hemingway,et al.  Heterogeneous associations between smoking and a wide range of initial presentations of cardiovascular disease in 1 937 360 people in England: lifetime risks and implications for risk prediction , 2014, International journal of epidemiology.

[55]  Katherine I. Morley,et al.  Defining Disease Phenotypes Using National Linked Electronic Health Records: A Case Study of Atrial Fibrillation , 2014, PloS one.

[56]  P. Ryan,et al.  Fidelity Assessment of a Clinical Practice Research Datalink Conversion to the OMOP Common Data Model , 2014, Drug Safety.

[57]  S. Denaxas,et al.  Socioeconomic Deprivation and the Incidence of 12 Cardiovascular Diseases in 1.9 Million Women and Men: Implications for Risk Prediction and Prevention , 2014, PloS one.

[58]  K. Bhaskaran,et al.  Association between clinical presentations before myocardial infarction and coronary mortality: a prospective population-based study using linked electronic records , 2014, European heart journal.

[59]  Tudor I. Oprea,et al.  Temporal disease trajectories condensed from population-wide registry data covering 6.2 million patients , 2014, Nature Communications.

[60]  D. Reeves,et al.  ClinicalCodes: An Online Clinical Codes Repository to Improve the Validity and Reproducibility of Research Using Electronic Medical Records , 2014, PloS one.

[61]  Harry Hemingway,et al.  Blood pressure and incidence of twelve cardiovascular diseases: lifetime risks, healthy life-years lost, and age-specific associations in 1·25 million people , 2014, The Lancet.

[62]  K. Bhaskaran,et al.  Completeness and usability of ethnicity data in UK-based primary care and hospital databases , 2013, Journal of public health.

[63]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[64]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[65]  Martijn J. Schuemie,et al.  Replication of the OMOP Experiment in Europe: Evaluating Methods for Risk Identification in Electronic Health Record Databases , 2013, Drug Safety.

[66]  F. Fullam,et al.  Assessing the Impact of Electronic Health Records as an Enabler of Hospital Quality and Patient Satisfaction , 2013, Academic medicine : journal of the Association of American Medical Colleges.

[67]  K. Bhaskaran,et al.  Representativeness and optimal use of body mass index (BMI) in the UK Clinical Practice Research Datalink (CPRD) , 2013, BMJ Open.

[68]  Kamran Sartipi,et al.  HL7 FHIR: An Agile and RESTful approach to healthcare information exchange , 2013, Proceedings of the 26th IEEE International Symposium on Computer-Based Medical Systems.

[69]  Melissa A. Basford,et al.  The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future , 2013, Genetics in Medicine.

[70]  Melissa A. Basford,et al.  Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[71]  Spiros C. Denaxas,et al.  Completeness and diagnostic validity of recording acute myocardial infarction events in primary care, hospital care, disease registry, and national mortality records: cohort study , 2013, BMJ.

[72]  Parminder Raina,et al.  Linking Canadian Population Health Data: Maximizing the Potential of Cohort and Administrative Data , 2013, Canadian Journal of Public Health.

[73]  A. Gallagher,et al.  Cancer recording and mortality in the General Practice Research Database and linked cancer registries , 2013, Pharmacoepidemiology and drug safety.

[74]  George Hripcsak,et al.  Next-generation phenotyping of electronic health records , 2012, J. Am. Medical Informatics Assoc..

[75]  Dipak Kalra,et al.  Data Resource Profile: Cardiovascular disease research using linked bespoke studies and electronic health records (CALIBER) , 2012, International journal of epidemiology.

[76]  L. Smeeth,et al.  Influenza Infection and Risk of Acute Myocardial Infarction in England and Wales: A CALIBER Self-Controlled Case Series Study , 2012, The Journal of infectious diseases.

[77]  K. Bibbins-Domingo,et al.  Cigarette smoking exposure and heart failure risk in older adults: the Health, Aging, and Body Composition Study. , 2012, American heart journal.

[78]  P. Ziprin,et al.  Systematic review of discharge coding accuracy. , 2012, Journal of public health.

[79]  T. V. van Staa,et al.  Recent advances in the utility and use of the General Practice Research Database as an example of a UK Primary Care Data resource , 2012, Therapeutic advances in drug safety.

[80]  Joshua C. Denny,et al.  An Evaluation of the NQF Quality Data Model for Representing Electronic Health Record Driven Phenotyping Algorithms , 2012, AMIA.

[81]  L. Smeeth,et al.  The Myocardial Ischaemia National Audit Project (MINAP) , 2010, Heart.

[82]  Michael Boehnke,et al.  LocusZoom: regional visualization of genome-wide association scan results , 2010, Bioinform..

[83]  B. Howard,et al.  Diabetes and incident heart failure in hypertensive and normotensive participants of the Strong Heart Study , 2010, Journal of hypertension.

[84]  L. Smeeth,et al.  Validation and validity of diagnoses in the General Practice Research Database: a systematic review , 2010, British journal of clinical pharmacology.

[85]  L. Wallentin,et al.  Abstract 1428: Statin Use After Myocardial Iinfarction Improves Survival in Nearly All With Renal Dysfunction: Data From the Swedish Web-system for Enhancement and Development of Evidence-based Care in Heart Disease Evaluated According to Recommended Therapies (SWEDEHEART) , 2009 .

[86]  R. Lyons,et al.  The SAIL Databank: building a national architecture for e-health research and evaluation , 2009, BMC health services research.

[87]  A. Scott,et al.  Has payment by results affected the way that English hospitals provide care? Difference-in-differences analysis , 2009, BMJ : British Medical Journal.

[88]  Kerina H. Jones,et al.  The SAIL databank: linking multiple health and social care datasets , 2009, BMC Medical Informatics Decis. Mak..

[89]  John Doucette,et al.  Adopting electronic medical records in primary care: Lessons learned from health information systems implementation experience in seven countries , 2009, Int. J. Medical Informatics.

[90]  A. J. Bass,et al.  A decade of data linkage in Western Australia: strategic design, applications and benefits of the WA data linkage system. , 2008, Australian health review : a publication of the Australian Hospital Association.

[91]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[92]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[93]  P. Roderick,et al.  The Index of Multiple Deprivation 2000 and accessibility effects on health , 2004, Journal of Epidemiology and Community Health.

[94]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[95]  Daniel Levy,et al.  Systolic Blood Pressure, Diastolic Blood Pressure, and Pulse Pressure as Predictors of Risk for Congestive Heart Failure in the Framingham Heart Study , 2003, Annals of Internal Medicine.

[96]  B. Hamber Publications , 1998, Weed Technology.

[97]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[98]  C Payne,et al.  Read Codes Version 3: A User Led Terminology , 1995, Methods of Information in Medicine.

[99]  J. Nielson,et al.  Current procedural terminology (CPT). , 2016, JAMA.