What can millions of laboratory test results tell us about the temporal aspect of data quality? Study of data spanning 17 years in a clinical data warehouse

OBJECTIVE To identify common temporal evolution profiles in biological data and propose a semi-automated method to these patterns in a clinical data warehouse (CDW). MATERIALS AND METHODS We leveraged the CDW of the European Hospital Georges Pompidou and tracked the evolution of 192 biological parameters over a period of 17 years (for 445,000 + patients, and 131 million laboratory test results). RESULTS We identified three common profiles of evolution: discretization, breakpoints, and trends. We developed computational and statistical methods to identify these profiles in the CDW. Overall, of the 192 observed biological parameters (87,814,136 values), 135 presented at least one evolution. We identified breakpoints in 30 distinct parameters, discretizations in 32, and trends in 79. DISCUSSION AND CONCLUSION our method allowed the identification of several temporal events in the data. Considering the distribution over time of these events, we identified probable causes for the observed profiles: instruments or software upgrades and changes in computation formulas. We evaluated the potential impact for data reuse. Finally, we formulated recommendations to enable safe use and sharing of biological data collection to limit the impact of data evolution in retrospective and federated studies (e.g. the annotation of laboratory parameters presenting breakpoints or trends).

[1]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[2]  The HEGP component-based clinical information system , 2003, Int. J. Medical Informatics.

[3]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[4]  Patrice Degoulet,et al.  The Georges Pompidou University Hospital Clinical Data Warehouse: A 8-years follow-up experience , 2017, Int. J. Medical Informatics.

[5]  T NgRaymond,et al.  Distance-based outliers: algorithms and applications , 2000, VLDB 2000.

[6]  Carlos Sáez,et al.  Organizing data quality assessment of shifting biomedical data. , 2012, Studies in health technology and informatics.

[7]  William W. Stead,et al.  Assessing Data Quality: From Concordance, through Correctness and Completeness, to Valid Manipulatable Representations , 2000, J. Am. Medical Informatics Assoc..

[8]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[9]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[10]  Tamraparni Dasu,et al.  Data quality through knowledge engineering , 2003, KDD '03.

[11]  M. Kahn,et al.  Data Quality Assessment for Comparative Effectiveness Research in Distributed Data Networks , 2013, Medical care.

[12]  Peter Benson,et al.  Data quality â¿¿ Part 110: Master data: Exchange of characteristic data: Syntax, semantic encoding, and conformance to data specification , 2008 .

[13]  Ronald G. Hauser,et al.  LabRS: A Rosetta stone for retrospective standardization of clinical laboratory test results , 2018, J. Am. Medical Informatics Assoc..

[14]  John D. Williams,et al.  The data warehouse as a foundation for population-based reference intervals. , 2003, American journal of clinical pathology.

[15]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[16]  Carlos Sáez,et al.  Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories , 2016, J. Am. Medical Informatics Assoc..

[17]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[18]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[19]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[20]  S. Bakken,et al.  A Data Quality Assessment Guideline for Electronic Health Record Data Reuse , 2017, EGEMS.

[21]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[22]  Isaac S. Kohane,et al.  Architecture of the Open-source Clinical Research Chart from Informatics for Integrating Biology and the Bedside , 2007, AMIA.

[23]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[24]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[25]  George Hripcsak,et al.  Defining and measuring completeness of electronic health records for secondary use , 2013, J. Biomed. Informatics.

[26]  Idris A. Eckley,et al.  changepoint: An R Package for Changepoint Analysis , 2014 .

[27]  J. Ioannidis,et al.  In the Era of Precision Medicine and Big Data, Who Is Normal? , 2018, JAMA.

[28]  Patrice Degoulet,et al.  Methodology of integration of a clinical data warehouse with a clinical information system: the HEGP case , 2010, MedInfo.

[29]  Atul J. Butte,et al.  Creating ethnicity-specific reference intervals for lab tests from EHR data , 2017, bioRxiv.

[30]  Kari A. Stephens,et al.  Exploring completeness in clinical data research networks with DQe-c , 2017, CRI.

[31]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[32]  Divesh Srivastava,et al.  Discovery of complex glitch patterns: A novel approach to Quantitative Data Cleaning , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[33]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[34]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[35]  Steven G. Johnson,et al.  A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data , 2016, EGEMS.

[36]  S. Sukumar,et al.  Quality of Big Data in health care. , 2015, International journal of health care quality assurance.

[37]  Keith Marsolo,et al.  A longitudinal analysis of data quality in a large pediatric data research network , 2017, J. Am. Medical Informatics Assoc..

[38]  Jyotishman Pathak,et al.  A Framework for Data Quality Assessment in Clinical Research Datasets , 2017, AMIA.

[39]  Nicholas P. Tatonetti Translational medicine in the Age of Big Data , 2019, Briefings Bioinform..

[40]  Arthur W. Toga,et al.  Sharing big biomedical data , 2015, Journal of Big Data.

[41]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[42]  Suresh Venkatasubramanian,et al.  Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams , 2009, IDA.

[43]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.