Enhancing medical data quality through data curation: a case study in primary Sjögren's syndrome.

OBJECTIVES To address the need for automatically assessing the quality of clinical data in terms of accuracy, relevance, conformity, and completeness, through the concise development and application of an automated method which is able to automatically detect problematic fields and match clinical terms under a specific domain. METHODS The proposed methodology involves the automated construction of three diagnostic reports that summarise valuable information regarding the types and ranges of each term in the dataset, along with the detected outliers, inconsistencies, and missing values, followed by a set of clinically relevant terms based on a reference model which serves as a set of terms which describes the domain knowledge of a disease of interest. RESULTS A case study was conducted using anonymised data from 250 patients who have been diagnosed with primary Sjögren's syndrome (pSS), yielding reliable outcomes that were highlighted for clinical evaluation. Our method was able to successfully identify 28 features with detected outliers, and unknown data types, as well as, identify outliers, missing values, similar terms, and inconsistencies within the dataset. The data standardisation method was able to match 76 out of 85 (89.41%) pSS-related terms according to a standard pSS reference model which has been introduced by the clinicians. CONCLUSIONS Our results confirm the clinical value of the data curation method towards the improvement of the dataset quality through the precise identification of outliers, missing values, inconsistencies, and similar terms, as well as, through the automated detection of pSS-related relevant terms towards data standardisation.

[1]  Mia Hubert,et al.  Robust statistics for outlier detection , 2011, WIREs Data Mining Knowl. Discov..

[2]  Dimitrios I. Fotiadis,et al.  Medical data quality assessment: On the development of an automated framework for medical data curation , 2019, Comput. Biol. Medicine.

[3]  L. Myers,et al.  Spearman Correlation Coefficients, Differences between , 2004 .

[4]  Andrew P. Reimer,et al.  Data quality assessment framework to assess electronic medical record data for use in research , 2016, Int. J. Medical Informatics.

[5]  Shiliang Sun,et al.  A review of natural language processing techniques for opinion mining systems , 2017, Inf. Fusion.

[6]  Stef van Buuren,et al.  Flexible Imputation of Missing Data , 2012 .

[7]  Ping Yu,et al.  A Review of Data Quality Assessment Methods for Public Health Information Systems , 2014, International journal of environmental research and public health.

[8]  F. Ferro,et al.  One year in review 2018: Sjögren's syndrome. , 2018, Clinical and experimental rheumatology.

[9]  Dimitrios I. Fotiadis,et al.  Towards the Establishment of a Biomedical Ontology for the Primary Sjögren’s Syndrome , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[10]  Yangyong Zhu,et al.  The Challenges of Data Quality and Data Quality Assessment in the Big Data Era , 2015, Data Sci. J..

[11]  M. Bombardieri,et al.  The role of salivary gland histopathology in primary Sjögren's syndrome: promises and pitfalls. , 2018, Clinical and experimental rheumatology.

[12]  George Hripcsak,et al.  Defining and measuring completeness of electronic health records for secondary use , 2013, J. Biomed. Informatics.