Columbia Open Health Data, clinical concept prevalence and co-occurrence from electronic health records

Columbia Open Health Data (COHD) is a publicly accessible database of electronic health record (EHR) prevalence and co-occurrence frequencies between conditions, drugs, procedures, and demographics. COHD was derived from Columbia University Irving Medical Center’s Observational Health Data Sciences and Informatics (OHDSI) database. The lifetime dataset, derived from all records, contains 36,578 single concepts (11,952 conditions, 12,334 drugs, and 10,816 procedures) and 32,788,901 concept pairs from 5,364,781 patients. The 5-year dataset, derived from records from 2013–2017, contains 29,964 single concepts (10,159 conditions, 10,264 drugs, and 8,270 procedures) and 15,927,195 concept pairs from 1,790,431 patients. Exclusion of rare concepts (count ≤ 10) and Poisson randomization enable data sharing by eliminating risks to patient privacy. EHR prevalences are informative of healthcare consumption rates. Analysis of co-occurrence frequencies via relative frequency analysis and observed-expected frequency ratio are informative of associations between clinical concepts, useful for biomedical research tasks such as drug repurposing and pharmacovigilance. COHD is publicly accessible through a web application-programming interface (API) and downloadable from the Figshare repository. The code is available on GitHub. Design Type(s) source-based data analysis objective • data refinement and optimization objective • clinical history design Measurement Type(s) electronic health record data Technology Type(s) digital curation Factor Type(s) temporal_interval • Concept Scheme Sample Characteristic(s) Homo sapiens Machine-accessible metadata file describing the reported data (ISA-Tab format)

[1]  K. Luyckx,et al.  Data integration of structured and unstructured sources for assigning clinical codes to patient stays , 2015, J. Am. Medical Informatics Assoc..

[2]  Cary P Gross,et al.  The importance of clinical trial data sharing: toward more open science. , 2012, Circulation. Cardiovascular quality and outcomes.

[3]  Bradley Malin,et al.  Evaluating re-identification risks with respect to the HIPAA privacy rule , 2010, J. Am. Medical Informatics Assoc..

[4]  James J. Cimino,et al.  Automated knowledge extraction from MEDLINE citations , 2000, AMIA.

[5]  Russ B Altman,et al.  Extracting and characterizing gene-drug relationships from the literature. , 2004, Pharmacogenetics.

[6]  Dean F Sittig,et al.  Matching identifiers in electronic health records: implications for duplicate records and patient safety , 2013, BMJ quality & safety.

[7]  George Hripcsak,et al.  Effect of vocabulary mapping for conditions on phenotype cohorts , 2018, J. Am. Medical Informatics Assoc..

[8]  A. Jemal,et al.  Cancer statistics, 2018 , 2018, CA: a cancer journal for clinicians.

[9]  B. Grant,et al.  Co-occurrence of DSM-IV personality disorders in the United States: results from the National Epidemiologic Survey on Alcohol and Related Conditions. , 2005, Comprehensive psychiatry.

[10]  W M Brutinel,et al.  The shrinking lungs syndrome in systemic lupus erythematosus. , 2000, Mayo Clinic proceedings.

[11]  Caroline Blaum,et al.  The Co‐Occurrence of Chronic Diseases and Geriatric Syndromes: The Health and Retirement Study , 2009, Journal of the American Geriatrics Society.

[12]  Paul A. Herzberg,et al.  Principles of Statistics , 1983 .

[13]  Vital signs: prevalence, treatment, and control of high levels of low-density lipoprotein cholesterol--United States, 1999-2002 and 2005-200. , 2011, MMWR. Morbidity and mortality weekly report.

[14]  J. Carstensen,et al.  Estimating disease prevalence using a population-based administrative healthcare database , 2007, Scandinavian journal of public health.

[15]  Kathleen Bennett,et al.  Prevalence of chronic disease in the elderly based on a national pharmacy claims database. , 2006, Age and ageing.

[16]  M. Ward,et al.  Estimating Disease Prevalence and Incidence Using Administrative Data: Some Assembly Required , 2013, The Journal of Rheumatology.

[17]  J. Tschopp,et al.  Acute lung injury and outcomes after thoracic surgery , 2009, Current opinion in anaesthesiology.

[18]  Paul J Nietert,et al.  The Prevalence of Chronic Diseases and Multimorbidity in Primary Care Practice: A PPRNet Report , 2013, The Journal of the American Board of Family Medicine.

[19]  George Hripcsak,et al.  Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[20]  George Hripcsak,et al.  Mining a clinical data warehouse to discover disease-finding associations using co-occurrence statistics , 2005, AMIA.

[21]  R. Gonzales,et al.  Uncomplicated Acute Bronchitis , 2000, Annals of Internal Medicine.

[22]  Rebecca L. Siegel Mph,et al.  Cancer statistics, 2018 , 2018 .

[23]  V Seagroatt,et al.  Use of large medical databases to study associations between diseases. , 2000, QJM : monthly journal of the Association of Physicians.

[24]  J. Castro‐Rodriguez,et al.  Anticholinergics in the treatment of children and adults with acute asthma: a systematic review with meta-analysis , 2005, Thorax.

[25]  Shahram Ebadollahi,et al.  Early detection of heart failure with varying prediction windows by structured and unstructured data in electronic health records , 2015, 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[26]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[27]  B. Lo Sharing clinical trial data: maximizing benefits, minimizing risk. , 2015, JAMA.

[28]  Xiaoyan Wang,et al.  Active computerized pharmacovigilance using natural language processing, statistics, and electronic health records: a feasibility study. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[29]  Hagit Shatkay,et al.  Co-occurrence of medical conditions: Exposing patterns through probabilistic topic modeling of snomed codes , 2018, J. Biomed. Informatics.

[30]  Salim Yusuf,et al.  Efficacy and safety of dabigatran compared with warfarin at different levels of international normalised ratio control for stroke prevention in atrial fibrillation: an analysis of the RE-LY trial , 2010, The Lancet.

[31]  A. Jha,et al.  Meaningful use of electronic health records: the road ahead. , 2010, JAMA.

[32]  Nigam H. Shah,et al.  Building the graph of medicine from millions of clinical narratives , 2014, Scientific Data.

[33]  J. Valderas,et al.  Comparison of the information provided by electronic health records data and a population health survey to estimate prevalence of selected health conditions and multimorbidity , 2013, BMC Public Health.

[34]  V. Burt,et al.  Hypertension among adults in the United States: National Health and Nutrition Examination Survey, 2011-2012. , 2013, NCHS data brief.