Evaluating EHR Data Availability for Cohort Selection in Retrospective Studies

Electronic health record (EHR) systems store longitudinal medical data that allows for retrospective studies in healthcare. An ideal EHR would clearly designate when patient data starts and ends, allowing maximal data utility in retrospective studies. However, with differences in EHR implementation, and patients receiving treatment across many healthcare systems, timelines for reliable longitudinal medical data are often unclear, and retrospective studies include suboptimal data. To better identify reliable EHR data for retrospective studies, we built metrics to restrict and weight data based on availability. Our metrics measure the rise and persistence of three different datatypes in an EHR: billing codes, medication events, and tumor registry diagnoses. We implemented our metrics in a generalized cohort creation heuristic to select cohorts with reliable data. We applied our heuristic to select a cohort of stage I-III breast cancer patients at Vanderbilt University Medical Venter (VUMC) for a retrospective study on five-year adjuvant endocrine therapy adherence. Recent clinical trials report five-year adherence at 85%, but studies in the general patient population report lower five-year adherence rates. With our heuristic, we determined a five-year adherence rate bounded between 55% and 78%.

[1]  S Masood,et al.  Estrogen and progesterone receptors in cytology: A comprehensive review , 1992, Diagnostic cytopathology.

[2]  Jing Zhao Temporal weighting of clinical events in electronic health records for pharmacovigilance , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[3]  Y Wang,et al.  Effects of chemotherapy and hormonal therapy for early breast cancer on recurrence and 15-year survival: an overview of the randomised trials , 2005, The Lancet.

[4]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[5]  Mike Clarke,et al.  Tamoxifen for early breast cancer: an overview of the randomised trials , 1998, The Lancet.

[6]  Kenneth D. Mandl,et al.  The Tell-Tale Heart: Population-Based Surveillance Reveals an Association of Rofecoxib and Celecoxib with Myocardial Infarction , 2007, PloS one.

[7]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[8]  Prakash M. Nadkarni,et al.  Drug safety surveillance using de-identified EMR and claims data: issues and challenges , 2010, J. Am. Medical Informatics Assoc..

[9]  Elmer V. Bernstam,et al.  Rediscovering drug side effects: the impact of analytical assumptions on the detection of associations in EHR data , 2015, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[10]  Timothy L. Lash,et al.  Adherence to tamoxifen over the five-year course , 2006, Breast Cancer Research and Treatment.

[11]  M Baum,et al.  Results of the ATAC (Arimidex, Tamoxifen, Alone or in Combination) trial after completion of 5 years' adjuvant treatment for breast cancer , 2005, The Lancet.

[12]  Ivanov,et al.  Effects of chemotherapy and hormonal therapy for early breast cancer on recurrence and 15-year survival: an overview of the randomised trials , 2005, The Lancet.

[13]  Dawn L. Hershman,et al.  Early discontinuation and non-adherence to adjuvant hormonal therapy are associated with increased mortality in women with breast cancer , 2011, Breast Cancer Research and Treatment.

[14]  James M Robins,et al.  On weighting approaches for missing data , 2013, Statistical methods in medical research.

[15]  B. Wells,et al.  Strategies for Handling Missing Data in Electronic Health Record Derived Data , 2013, EGEMS.

[16]  Dawn L Hershman,et al.  Perfecting breast-cancer treatment--incremental gains and musculoskeletal pains. , 2015, The New England journal of medicine.