Assessing the quality of clinical and administrative data extracted from hospitals: the General Medicine Inpatient Initiative (GEMINI) experience.

OBJECTIVE Large clinical databases are increasingly used for research and quality improvement. We describe an approach to data quality assessment from the General Medicine Inpatient Initiative (GEMINI), which collects and standardizes administrative and clinical data from hospitals. METHODS The GEMINI database contained 245 559 patient admissions at 7 hospitals in Ontario, Canada from 2010 to 2017. We performed 7 computational data quality checks and iteratively re-extracted data from hospitals to correct problems. Thereafter, GEMINI data were compared to data that were manually abstracted from the hospital's electronic medical record for 23 419 selected data points on a sample of 7488 patients. RESULTS Computational checks flagged 103 potential data quality issues, which were either corrected or documented to inform future analysis. For example, we identified the inclusion of canceled radiology tests, a time shift of transfusion data, and mistakenly processing the chemical symbol for sodium ("Na") as a missing value. Manual validation identified 1 important data quality issue that was not detected by computational checks: transfusion dates and times at 1 site were unreliable. Apart from that single issue, across all data tables, GEMINI data had high overall accuracy (ranging from 98%-100%), sensitivity (95%-100%), specificity (99%-100%), positive predictive value (93%-100%), and negative predictive value (99%-100%) compared to the gold standard. DISCUSSION AND CONCLUSION Computational data quality checks with iterative re-extraction facilitated reliable data collection from hospitals but missed 1 critical quality issue. Combining computational and manual approaches may be optimal for assessing the quality of large multisite clinical databases.

[1]  Kit C. B. Roes,et al.  Validation of multisource electronic health record data: an application to blood transfusion data , 2017, BMC Medical Informatics and Decision Making.

[2]  Amardeep Thind,et al.  A basic model for assessing primary health care electronic medical record data quality , 2019, BMC Medical Informatics and Decision Making.

[3]  Thomas Neubauer,et al.  A methodology for the pseudonymization of medical data , 2011, Int. J. Medical Informatics.

[4]  K. Bhaskaran,et al.  Data Resource Profile: Clinical Practice Research Datalink (CPRD) , 2015, International journal of epidemiology.

[5]  Lorne Zinman,et al.  The utility of multivariate outlier detection techniques for data quality evaluation in large studies: an application within the ONDRI project , 2019, BMC Medical Research Methodology.

[6]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[7]  P. Embí,et al.  Toward Reuse of Clinical Data for Research and Quality Improvement: The End of the Beginning? , 2009, Annals of Internal Medicine.

[8]  Nicolette de Keizer,et al.  Influence of data quality on computed Dutch hospital quality indicators: a case study in colorectal cancer surgery , 2014, BMC Medical Informatics and Decision Making.

[9]  J. Steiner,et al.  A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. , 2012, Medical care.

[10]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[11]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[12]  J A Cook,et al.  The rise of big clinical databases , 2015, The British journal of surgery.

[13]  Ian J. Douglas,et al.  How to validate a diagnosis recorded in electronic health records , 2019, Breathe.

[14]  Keith Marsolo,et al.  Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet®) , 2018, EGEMS.

[15]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[16]  Steven G. Johnson,et al.  A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data , 2016, EGEMS.

[17]  Adam Wright,et al.  Using statistical anomaly detection models to find clinical decision support malfunctions , 2018, J. Am. Medical Informatics Assoc..

[18]  Christine M Baca,et al.  Axon Registry® data validation , 2019, Neurology.

[19]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[20]  R. Birtwhistle,et al.  Update from CPCSSN. , 2016, Canadian family physician Medecin de famille canadien.

[21]  Tyler Williamson,et al.  Validating the 8 CPCSSN Case Definitions for Chronic Disease Surveillance in a Primary Care Database of Electronic Health Records , 2014, The Annals of Family Medicine.

[22]  Roger Eeckels,et al.  Data Cleaning: Detecting, Diagnosing, and Editing Data Abnormalities , 2005, PLoS medicine.

[23]  Mark Smith,et al.  Assessing the quality of administrative data for research: a framework from the Manitoba Centre for Health Policy , 2018, J. Am. Medical Informatics Assoc..

[24]  Eric I Benchimol,et al.  Routinely collected data: the importance of high-quality diagnostic coding to research , 2017, Canadian Medical Association Journal.

[25]  Michael A. Barnes,et al.  Comparison of accuracy of physical examination findings in initial progress notes between paper charts and a newly implemented electronic health record , 2017, J. Am. Medical Informatics Assoc..

[26]  Ross E. G. Upshur,et al.  Patient characteristics, resource use and outcomes associated with general internal medicine hospital care: the General Medicine Inpatient Initiative (GEMINI) retrospective cohort study. , 2017, CMAJ open.