Data Quality for Medical Data Lakelands

Medical research requires biological material and data. Medical studies based on data with unknown or questionable quality are useless or even dangerous, as evidenced by recent examples of withdrawn studies. Medical data sets consist of highly sensitive personal data, which has to be protected carefully and is only available for research after approval of ethics committees. These data sets, therefore, cannot be stored in central data warehouses or even in a common data lake but remain in a multitude of data lakes, which we call Data Lakelands. An example for such a Medical Data Lakelands are the collections of samples and their annotations in the European federation of biobanks (BBMRI-ERIC). We discuss the quality dimensions for data sets for medical research and the requirements for providers of data sets in terms of both quality of meta-data and meta-data of data quality documentation with the aim to support researchers to effectively and efficiently identify suitable data sets for medical studies.

[1]  K. Zatloukal,et al.  Human tissue biobanks as instruments for drug discovery and development: impact on personalized medicine. , 2010, Biomarkers in Medicine.

[2]  Renée J. Miller,et al.  Data Lake Management: Challenges and Opportunities , 2019, Proc. VLDB Endow..

[3]  A Berghold,et al.  The Genome Austria Tissue Bank (GATiB) , 2007, Pathobiology.

[4]  Péter Király,et al.  Measuring completeness as metadata quality metric in Europeana , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[5]  Johann Eder,et al.  Priority-Based k-Anonymity Accomplished by Weighted Generalisation Structures , 2006, DaWaK.

[6]  Gunnar Hartvigsen,et al.  Using Fitness Trackers and Smartwatches to Measure Physical Activity in Research: Analysis of Consumer Wrist-Worn Wearables , 2018, Journal of medical Internet research.

[7]  Athanasios Manitsaris,et al.  Quantifying and measuring metadata completeness , 2012, J. Assoc. Inf. Sci. Technol..

[8]  Kurt Zatloukal,et al.  Biobanking of Human Biospecimens , 2017, Springer International Publishing.

[9]  E. Vuorio Networking Biobanks Throughout Europe: The Development of BBMRI-ERIC , 2017 .

[10]  Esteban Zimányi,et al.  Data Warehouse Systems , 2014, Data-Centric Systems and Applications.

[11]  Wendy A. Wolf,et al.  Public and Biobank Participant Attitudes toward Genetic Research Participation and Data Sharing , 2010, Public Health Genomics.

[12]  Georges Dagher,et al.  Biobanks for life sciences and personalized medicine: importance of standardization, biosafety, biosecurity, and data management. , 2019, Current opinion in biotechnology.

[13]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[14]  José Muñiz,et al.  Effect of the Number of Response Categories on the Reliability and Validity of Rating Scales , 2008 .

[15]  Nandana Mihindukulasooriya,et al.  A comprehensive quality model for Linked Data , 2018, Semantic Web.

[16]  M. Perola,et al.  BBMRI-ERIC as a resource for pharmaceutical and life science industries: the development of biobank-based Expert Centres , 2014, European Journal of Human Genetics.

[17]  Daniel Lorence Measuring Disparities in Information Capture Timeliness Across Healthcare Settings: Effects on Data Quality , 2004, Journal of Medical Systems.

[18]  Carlo Batini,et al.  Data and Information Quality , 2016, Data-Centric Systems and Applications.

[19]  Kathy Giannangelo,et al.  Chapter 10, Logical Observation Identifiers, Names, and Codes (LOINC®) , 2018 .

[20]  Sandra Geisler,et al.  Constance: An Intelligent Data Lake System , 2016, SIGMOD Conference.

[21]  Johann Eder,et al.  Information Systems for Federated Biobanks , 2009, Trans. Large Scale Data Knowl. Centered Syst..

[22]  Jan Barnsley,et al.  Measuring data reliability for preventive services in electronic medical records , 2012, BMC Health Services Research.

[23]  Jan-Eric Litton,et al.  Launch of an Infrastructure for Health Research: BBMRI-ERIC. , 2018, Biopreservation and biobanking.

[24]  D. Kyriacou Reliability and validity of diagnostic tests. , 2001, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[25]  Petr Holub,et al.  Conception and Implementation of an Austrian Biobank Directory Integration Framework. , 2017, Biopreservation and biobanking.

[26]  Petr Holub,et al.  Toward Global Biobank Integration by Implementation of the Minimum Information About BIobank Data Sharing (MIABIS 2.0 Core). , 2016, Biopreservation and biobanking.

[27]  Jack E. Olson,et al.  Data Quality: The Accuracy Dimension , 2003 .

[28]  Nelson Pacheco da Rocha,et al.  Analysis of the Data Consistency of Medical Imaging Information Systems: An Exploratory Study , 2019, CENTERIS/ProjMAN/HCist.

[29]  Les Gasser,et al.  Metadata Quality For Federated Collections , 2004, ICIQ.

[30]  Matteo Golfarelli,et al.  From Star Schemas to Big Data: 20+ Years of Data Warehouse Research , 2018, A Comprehensive Guide Through the Italian Database Research.

[31]  Jérôme Darmont,et al.  On data lake architectures and metadata management , 2020, Journal of Intelligent Information Systems.

[32]  J. Mandrekar Simple statistical measures for diagnostic accuracy assessment. , 2010, Journal of thoracic oncology : official publication of the International Association for the Study of Lung Cancer.

[33]  Yannis Kalfoglou,et al.  Ontology mapping: the state of the art , 2003, The Knowledge Engineering Review.

[34]  Johann Eder,et al.  Modelling Changes in Ontologies , 2004, OTM Workshops.

[35]  Christian Gieger,et al.  Harmonising and linking biomedical and clinical data across disparate data archives to enable integrative cross-biobank research , 2015, European Journal of Human Genetics.

[36]  Johann Eder,et al.  Supporting the Donation of Health Records to Biobanks for Medical Research , 2020, AI and ML for Digital Pathology.

[37]  P. Holub,et al.  BBMRI-ERIC Directory: 515 Biobanks with Over 60 Million Biological Samples , 2016, Biopreservation and biobanking.

[38]  Meredith Nahm,et al.  Data Quality in Clinical Research , 2012 .

[39]  Giri Kumar Tayi,et al.  Examining data quality , 1998, CACM.

[40]  Rui Liu,et al.  Draining the Data Swamp: A Similarity-based Approach , 2018, HILDA@SIGMOD.

[41]  Johann Eder,et al.  Incorporating Data Provenance in a Medical CSCW System , 2010, DEXA.

[42]  Dimosthenis Kyriazis,et al.  Delivering Reliability of Data Sources in IoT Healthcare Ecosystems , 2019, 2019 25th Conference of Open Innovations Association (FRUCT).

[43]  Johann Eder,et al.  IT Solutions for Privacy Protection in Biobanking , 2012, Public Health Genomics.

[44]  Bernhard Mitschang,et al.  Leveraging the Data Lake: Current State and Challenges , 2019, DaWaK.

[45]  Diane I. Hillmann,et al.  The Continuum of Metadata Quality: Defining, Expressing, Exploiting , 2004 .