Applying probabilistic temporal and multisite data quality control methods to a public health mortality registry in Spain: a systematic approach to quality control of repositories

OBJECTIVE To assess the variability in data distributions among data sources and over time through a case study of a large multisite repository as a systematic approach to data quality (DQ). MATERIALS AND METHODS Novel probabilistic DQ control methods based on information theory and geometry are applied to the Public Health Mortality Registry of the Region of Valencia, Spain, with 512 143 entries from 2000 to 2012, disaggregated into 24 health departments. The methods provide DQ metrics and exploratory visualizations for (1) assessing the variability among multiple sources and (2) monitoring and exploring changes with time. The methods are suited to big data and multitype, multivariate, and multimodal data. RESULTS The repository was partitioned into 2 probabilistically separated temporal subgroups following a change in the Spanish National Death Certificate in 2009. Punctual temporal anomalies were noticed due to a punctual increment in the missing data, along with outlying and clustered health departments due to differences in populations or in practices. DISCUSSION Changes in protocols, differences in populations, biased practices, or other systematic DQ problems affected data variability. Even if semantic and integration aspects are addressed in data sharing infrastructures, probabilistic variability may still be present. Solutions include fixing or excluding data and analyzing different sites or time periods separately. A systematic approach to assessing temporal and multisite variability is proposed. CONCLUSION Multisite and temporal variability in data distributions affects DQ, hindering data reuse, and an assessment of such variability should be a part of systematic DQ procedures.

[1]  Griffin M. Weber,et al.  Direct2Experts: a pilot national network to demonstrate interoperability among research-networking platforms , 2011, J. Am. Medical Informatics Assoc..

[2]  Sabine Van Huffel,et al.  Multiproject–multicenter evaluation of automatic brain tumor classification by magnetic resonance spectroscopy , 2008, Magnetic Resonance Materials in Physics, Biology and Medicine.

[3]  Sami Borg,et al.  Open Access to and Reuse of Research Data - The State of the Art in Finland , 2008 .

[4]  Mohamed Medhat Gaber,et al.  Learning from Data Streams: Processing Techniques in Sensor Networks , 2007 .

[5]  D. Moher,et al.  CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. , 2012, International journal of surgery.

[6]  M Cuggia,et al.  Big data and smart health strategies: findings from the health information systems perspective. , 2014, Yearbook of medical informatics.

[7]  Galit Shmueli,et al.  Research Commentary - Too Big to Fail: Large Samples and the p-Value Problem , 2013, Inf. Syst. Res..

[8]  W. A. Shewhart,et al.  Statistical method from the viewpoint of quality control , 1939 .

[9]  J. Gassman,et al.  Data quality assurance, monitoring, and reporting. , 1995, Controlled clinical trials.

[10]  Keith Marsolo,et al.  An i2b2-based, generalizable, open source, self-scaling chronic disease registry , 2012, J. Am. Medical Informatics Assoc..

[11]  Juan Miguel García-Gómez,et al.  Probabilistic change detection and visualization methods for the assessment of temporal stability in biomedical data quality , 2015, Data Mining and Knowledge Discovery.

[12]  M. Artés Statistical errors. , 1977, Medicina clinica.

[13]  Salvador Tortajada,et al.  Incremental Gaussian Discriminant Analysis based on Graybill and Deal weighted combination of estimators for brain tumour diagnosis , 2011, J. Biomed. Informatics.

[14]  D. Curran‐Everett,et al.  The fickle P value generates irreproducible results , 2015, Nature Methods.

[15]  G Svolba,et al.  Statistical quality control in clinical trials. , 1999, Controlled clinical trials.

[16]  Jessica D. Tenenbaum,et al.  Practices and perspectives on building integrated data repositories: results from a 2010 CTSA survey , 2012, J. Am. Medical Informatics Assoc..

[17]  Nan M. Laird,et al.  Using the General Linear Mixed Model to Analyse Unbalanced Repeated Measures and Longitudinal Data , 1997 .

[18]  Pradeep Kumar Ray,et al.  Towards an ontology for data quality in integrated chronic disease management: A realist review of the literature , 2013, Int. J. Medical Informatics.

[19]  Brian Hazlehurst,et al.  Using the CER Hub to ensure data quality in a multi-institution smoking cessation study , 2014, J. Am. Medical Informatics Assoc..

[20]  Rasika Rampatige,et al.  Strengthening civil registration and vital statistics for births, deaths and causes of death: resource kit , 2013 .

[21]  Miguel A Martinez-Beneito,et al.  Spatio-temporal evolution of female lung cancer mortality in a region of Spain, is it worth taking migration into account? , 2008, BMC Cancer.

[22]  Carlos Sáez,et al.  Comparative study of probability distribution distances to define a metric for the stability of multi-source biomedical research data , 2013, 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[23]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[24]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[25]  Ping Yu,et al.  A Review of Data Quality Assessment Methods for Public Health Information Systems , 2014, International journal of environmental research and public health.

[26]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[27]  G. Knatterud,et al.  Management and conduct of randomized controlled trials. , 2002, Epidemiologic reviews.

[28]  Lisa Lang,et al.  An informatics agenda for public health: summarized recommendations from the 2011 AMIA PHI Conference , 2012, J. Am. Medical Informatics Assoc..

[29]  Pedro Pereira Rodrigues,et al.  Data Quality and Integration Issues in Electronic Health Records , 2009 .

[30]  Sandro Galea,et al.  A model of underlying socioeconomic vulnerability in human populations: evidence from variability in population health and implications for public health. , 2005, Social science & medicine.

[31]  S L George,et al.  Guidelines for quality assurance in multicenter trials: a position paper. , 1998, Controlled clinical trials.

[32]  Regina Nuzzo,et al.  Scientific method: Statistical errors , 2014, Nature.

[33]  D. Moher,et al.  CONSORT 2010 Explanation and Elaboration: updated guidelines for reporting parallel group randomised trials , 2010, BMJ : British Medical Journal.

[34]  M Eiselt,et al.  Using mutual information to measure coupling in the cardiorespiratory system. , 1998, IEEE engineering in medicine and biology magazine : the quarterly magazine of the Engineering in Medicine & Biology Society.

[35]  F. Bray,et al.  Evaluation of data quality in the cancer registry: principles and methods. Part I: comparability, validity and timeliness. , 2009, European journal of cancer.

[36]  D. Goodridge,et al.  Pilot study: assessment of interlaboratory variability of sequencing-based typing DNA sequence data quality. , 2007, Tissue antigens.

[37]  Carlos Sáez,et al.  Stability metrics for multi-source biomedical data based on simplicial projections from probability distribution distances , 2017, Statistical methods in medical research.

[38]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[39]  Douglas MacFadden,et al.  Application of Information Technology The Shared Health Research Information Network ( SHRINE ) : A Prototype Federated Query Tool for Clinical Data Repositories , 2014 .

[40]  N. Laird,et al.  Using the general linear mixed model to analyse unbalanced repeated measures and longitudinal data. , 1997, Statistics in medicine.

[41]  J. Steiner,et al.  A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. , 2012, Medical care.

[42]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[43]  Douglas MacFadden,et al.  SHRINE: Enabling Nationally Scalable Multi-Site Disease Studies , 2013, PloS one.

[44]  Chin-Tser Huang,et al.  Mutual information applied to anomaly detection , 2008, Journal of Communications and Networks.