Quality assessment of real-world data repositories across the data life cycle: A literature review

OBJECTIVE Data quality (DQ) must be consistently defined in context. The attributes, metadata, and context of longitudinal real-world data (RWD) have not been formalized for quality improvement across the data production and curation life cycle. We sought to complete a literature review on DQ assessment frameworks, indicators and tools for research, public health, service, and quality improvement across the data life cycle. MATERIALS AND METHODS The review followed PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Databases from health, physical and social sciences were used: Cinahl, Embase, Scopus, ProQuest, Emcare, PsycINFO, Compendex, and Inspec. Embase was used instead of PubMed (an interface to search MEDLINE) because it includes all MeSH (Medical Subject Headings) terms used and journals in MEDLINE as well as additional unique journals and conference abstracts. A combined data life cycle and quality framework guided the search of published and gray literature for DQ frameworks, indicators, and tools. At least 2 authors independently identified articles for inclusion and extracted and categorized DQ concepts and constructs. All authors discussed findings iteratively until consensus was reached. RESULTS The 120 included articles yielded concepts related to contextual (data source, custodian, and user) and technical (interoperability) factors across the data life cycle. Contextual DQ subcategories included relevance, usability, accessibility, timeliness, and trust. Well-tested computable DQ indicators and assessment tools were also found. CONCLUSIONS A DQ assessment framework that covers intrinsic, technical, and contextual categories across the data life cycle enables assessment and management of RWD repositories to ensure fitness for purpose. Balancing security, privacy, and FAIR principles requires trust and reciprocity, transparent governance, and organizational cultures that value good documentation.

[1]  F. Arnaud,et al.  From core referencing to data re-use: two French national initiatives to reinforce paleodata stewardship (National Cyber Core Repository and LTER France Retro-Observatory) , 2017 .

[2]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[3]  S. de Lusignan,et al.  An ‘integrated health neighbourhood’ framework to optimise the use of EHR data , 2016, BMJ Health & Care Informatics.

[4]  Ajit Londhe,et al.  Extending Achilles Heel Data Quality Tool with New Rules Informed by Multi-Site Data Quality Comparison , 2019, MedInfo.

[5]  Keith Marsolo,et al.  Evaluating Foundational Data Quality in the National Patient-Centered Clinical Research Network (PCORnet®) , 2018, EGEMS.

[6]  Pradeep Kumar Ray,et al.  Towards an ontology for data quality in integrated chronic disease management: A realist review of the literature , 2013, Int. J. Medical Informatics.

[7]  Patrick B. Ryan,et al.  Transparent Reporting of Data Quality in Distributed Data Networks , 2015, EGEMS.

[8]  Harshana Liyanage,et al.  Artificial Intelligence in Primary Health Care: Perceptions, Issues, and Challenges , 2019, Yearbook of Medical Informatics.

[9]  Andoni Beristain,et al.  TAQIH, a tool for tabular data quality assessment and improvement in the context of health data , 2019, Comput. Methods Programs Biomed..

[10]  Douglas Boyle,et al.  Improving a Secondary Use Health Data Warehouse: Proposing a Multi-Level Data Quality Framework , 2019, EGEMS.

[11]  Ramkiran Gouripeddi,et al.  Methods for examining data quality in healthcare integrated data repositories , 2018, PSB.

[12]  Yu-Chuan Li,et al.  Observational Health Data Sciences and Informatics (OHDSI): Opportunities for Observational Researchers , 2015, MedInfo.

[13]  Ramkiran Gouripeddi,et al.  Towards a content agnostic computable knowledge repository for data quality assessment , 2019, Comput. Methods Programs Biomed..

[14]  Jason Bennett Thatcher,et al.  Trust in a specific technology: An investigation of its components and measures , 2011, TMIS.

[15]  Harshana Liyanage,et al.  Ethical Use of Electronic Health Record Data and Artificial Intelligence: Recommendations of the Primary Care Informatics Working Group of the International Medical Informatics Association , 2020, Yearbook of Medical Informatics.

[16]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[17]  M. Kahn,et al.  Data Quality Assessment for Comparative Effectiveness Research in Distributed Data Networks , 2013, Medical care.

[18]  Keith Marsolo,et al.  A longitudinal analysis of data quality in a large pediatric data research network , 2017, J. Am. Medical Informatics Assoc..

[19]  Steven G. Johnson,et al.  A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data , 2016, EGEMS.

[20]  Patrick B. Ryan,et al.  Multisite Evaluation of a Data Quality Tool for Patient-Level Clinical Data Sets , 2016, EGEMS.

[21]  S de Lusignan,et al.  Building a Privacy, Ethics, and Data Access Framework for Real World Computerised Medical Record System Data: A Delphi Study , 2016, Yearbook of Medical Informatics.

[22]  Carlos Sáez,et al.  Guest editorial: Special issue in biomedical data quality assessment methods , 2019, Comput. Methods Programs Biomed..

[23]  Hairong Yu,et al.  Structured data quality reports to improve EHR data quality , 2015, Int. J. Medical Informatics.

[24]  Carlos Sáez,et al.  Organizing data quality assessment of shifting biomedical data. , 2012, Studies in health technology and informatics.