Medical data quality assessment: On the development of an automated framework for medical data curation

Data quality assessment has gained attention in the recent years since more and more companies and medical centers are highlighting the importance of an automated framework to effectively manage the quality of their big data. Data cleaning, also known as data curation, lies in the heart of the data quality assessment and is a key aspect prior to the development of any data analytics services. In this work, we present the objectives, functionalities and methodological advances of an automated framework for data curation from a medical perspective. The steps towards the development of a system for data quality assessment are first described along with multidisciplinary data quality measures. A three-layer architecture which realizes these steps is then presented. Emphasis is given on the detection and tracking of inconsistencies, missing values, outliers, and similarities, as well as, on data standardization to finally enable data harmonization. A case study is conducted in order to demonstrate the applicability and reliability of the proposed framework on two well-established cohorts with clinical data related to the primary Sjögren's Syndrome (pSS). Our results confirm the validity of the proposed framework towards the automated and fast identification of outliers, inconsistencies, and highly-correlated and duplicated terms, as well as, the successful matching of more than 85% of the pSS-related medical terms in both cohorts, yielding more accurate, relevant, and consistent clinical data.

[1]  Philippe Ravaud,et al.  EULAR Sjögren's syndrome disease activity index (ESSDAI): a user guide , 2015, RMD Open.

[2]  Lorena Otero-Cerdeira,et al.  Ontology matching: A literature review , 2015, Expert Syst. Appl..

[3]  Frada Burstein,et al.  A data quality framework, method and tools for managing data quality in a health care setting: an action case study , 2018, J. Decis. Syst..

[4]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[5]  Athanasios G. Tzioufas,et al.  Topical and systemic medications for the treatment of primary Sjögren's syndrome , 2012, Nature Reviews Rheumatology.

[6]  Eleni I. Georga,et al.  Cohort Harmonization and Integrative Analysis From a Biomedical Engineering Perspective , 2019, IEEE Reviews in Biomedical Engineering.

[7]  Shiliang Sun,et al.  A review of natural language processing techniques for opinion mining systems , 2017, Inf. Fusion.

[8]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[9]  Dimitrios I. Fotiadis,et al.  Towards the Establishment of a Biomedical Ontology for the Primary Sjögren’s Syndrome , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[10]  P. Gupta,et al.  Comparison of statistical methods for outlier detection in proficiency testing data on analysis of lead in aqueous solution , 2013 .

[11]  C. Lee,et al.  Medical big data: promise and challenges , 2017, Kidney research and clinical practice.

[12]  Tim Benson,et al.  Principles of Health Interoperability HL7 and SNOMED , 2009 .

[13]  Maria del Pilar Angeles,et al.  Comparison of Methods Hamming Distance, Jaro, and Monge-Elkan , 2015, DBKDA 2015.

[14]  Steven G. Johnson,et al.  A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data , 2016, EGEMS.

[15]  H. Koh,et al.  Data mining applications in healthcare. , 2005, Journal of healthcare information management : JHIM.

[16]  G. Appa Rao,et al.  Characteristic mining of Mathematical Formulas from Document - A Comparative Study on Sequence Matcher and Levenshtein Distance procedure , 2018 .

[17]  Hans-Peter Kriegel,et al.  Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection , 2012, Data Mining and Knowledge Discovery.

[18]  Morris A. Swertz,et al.  SORTA: a system for ontology-based re-coding and technical annotation of biomedical phenotype data , 2015, Database J. Biol. Databases Curation.

[19]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[20]  Andrew P. Reimer,et al.  Data quality assessment framework to assess electronic medical record data for use in research , 2016, Int. J. Medical Informatics.

[21]  Mia Hubert,et al.  Robust statistics for outlier detection , 2011, WIREs Data Mining Knowl. Discov..

[22]  Cedric Gondro,et al.  Quality control for genome-wide association studies. , 2013, Methods in molecular biology.

[23]  Rajesh Wadhvani,et al.  A Review on Text Similarity Technique used in IR and its Application , 2015 .

[24]  Djamil Aïssani,et al.  Semantic web services: Standards, applications, challenges and solutions , 2014, J. Netw. Comput. Appl..

[25]  Jin Wang,et al.  A unified framework for string similarity search with edit-distance constraint , 2016, The VLDB Journal.

[26]  Ping Yu,et al.  A Review of Data Quality Assessment Methods for Public Health Information Systems , 2014, International journal of environmental research and public health.

[27]  Ihab F. Ilyas,et al.  Data Cleaning: Overview and Emerging Challenges , 2016, SIGMOD Conference.

[28]  George Hripcsak,et al.  Defining and measuring completeness of electronic health records for secondary use , 2013, J. Biomed. Informatics.