An automated data verification approach for improving data quality in a clinical registry

BACKGROUND AND OBJECTIVE The quality of data is crucial for clinical registry studies as it impacts credibility. In the regular practice of most such studies, a vulnerability arises from researchers recording data on paper-based case report forms (CRFs) and further transcribing them onto registry databases. To ensure the quality of data, verifying data in the registry is necessary. However, traditional manual data verification methods are time-consuming, labor-intensive and of limited-effect. As paper-based CRFs and electronic medical records (EMRs) are two sources for verification, we propose an automated data verification approach based on the techniques of optical character recognition (OCR) and information retrieval to identify data errors in a registry more efficiently. METHODS Three steps are involved to develop the automated verification approach. First, we analyze the scanned images of paper-based CRFs with machine learning enhanced OCR to recognize the checkbox marks and hand-writing. Then, we retrieve the related patient information from the EMRs using natural language processing (NLP) techniques. Finally, we compare the retrieved information in the previous two steps with the data in the registry, and synthesize the results accordingly. The proposed automated method has been applied in a Chinese registry study and the difference between automated and manual approach has been evaluated. RESULTS The automated approach has been implemented in The Chinese Coronary Artery Disease Registry. For CRF data recognition, the accuracy of recognition for checkboxes marks and hand-writing are 0.93 and 0.74, respectively. For EMR data extraction, the accuracy of information retrieval from textual electronic medical records is 0.97. The accuracy, recall and time consumption of the automated approach are 0.93, 0.96 and 0.5 h, better than the corresponding values of the manual approach, which are 0.92, 0.71 and 7.5 h. CONCLUSIONS Compared to the manual data verification approach, the automated approach enhances the recall of identify data errors and has a higher accuracy. The time consumed is far less. The results show that the automated approach is more effective and efficient for identifying incomplete data and incorrect data in a registry. The proposed approach has potential to improve the quality of registry data.

[1]  D. Solomon,et al.  Evaluation and implementation of public health registries. , 1991, Public health reports.

[2]  Jun Guo,et al.  A novel drop-fall algorithm based on digital features for touching digit segmentation , 2016, 2016 IEEE 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON).

[3]  Stephen Young,et al.  Evaluating Source Data Verification as a Quality Control Measure in Clinical Trials , 2014, Therapeutic innovation & regulatory science.

[4]  D. King,et al.  A quantifiable alternative to double data entry. , 2000, Controlled clinical trials.

[5]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[6]  S Day,et al.  Double data entry: what value, what price? , 1998, Controlled clinical trials.

[7]  Lei Liu,et al.  Extracting important information from Chinese Operation Notes with natural language processing methods , 2014, J. Biomed. Informatics.

[8]  Sandeep K. Gupta,et al.  Paperless clinical trials: Myth or reality? , 2015, Indian journal of pharmacology.

[9]  B. Lind,et al.  Quality assurance and quality control in longitudinal studies. , 1998, Epidemiologic reviews.

[10]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[11]  Jeppe Ragnar Andersen,et al.  Impact of source data verification on data quality in clinical trials: an empirical post hoc analysis of three phase 3 randomized clinical trials. , 2015, British journal of clinical pharmacology.

[12]  Catrin Tudur Smith,et al.  The Value of Source Data Verification in a Cancer Clinical Trial , 2012, PloS one.

[13]  Hua Xu,et al.  Research and applications: A comprehensive study of named entity recognition in Chinese clinical text , 2014, J. Am. Medical Informatics Assoc..

[14]  Nicolette de Keizer,et al.  Model Formulation: Defining and Improving Data Quality in Medical Registries: A Literature Review, Case Study, and Generic Framework , 2002, J. Am. Medical Informatics Assoc..