Improvement of the quality of medical databases: data-mining-based prediction of diagnostic codes from previous patient codes

INTRODUCTION Diagnoses and medical procedures collected under the French system of information are recorded in a nationwide database, the "PMSI national database", which is accessible for exploitation. Quality of the data in this database is directly related to the quality of coding, which can be of poor quality. Among the proposed methods for the exploitation of health databases, data mining techniques are particularly interesting. Our objective is to build sequential rules for missing diagnoses prediction by data mining of the PMSI national database. METHOD Our working sample was constructed from the national database for years 2007 to 2010. The information retained for rules construction were medical diagnoses and medical procedures. The rules were selected using a statistical filter, and selected rules were validated by case review based on medical letters, which enabled to estimate the improvement of diagnoses recoding. RESULTS The work sample was made of 59,170 inpatient stays. The predicted ICD codes were E11 (non-insulin-dependent diabetes mellitus), I48 (atrial fibrillation and flutter) and I50 (heart failure).We validated three sequential rules with a substantial improvement of positive predictive value: {E11,I10,DZQM006}=>{E11} {E11,I10,I48}=>{E11} {I48,I69}=>{I48} DISCUSSION We were able to extract by data mining three simple, reliable and effective sequential rules, with a substantial improvement in diagnoses recoding. The results of our study indicate the opportunity to improve the data quality of the national database by data mining methods.

[1]  Chunhua Weng,et al.  Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research , 2013, J. Am. Medical Informatics Assoc..

[2]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[3]  Régis Beuscart,et al.  Patient safety through intelligent procedures in medication: the PSIP project. , 2009, Studies in health technology and informatics.

[4]  H. Koh,et al.  Data mining applications in healthcare. , 2005, Journal of healthcare information management : JHIM.

[5]  Shelley A. Rusincovitch,et al.  Clinical Research Informatics and Electronic Health Record Data , 2014, Yearbook of Medical Informatics.

[6]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[7]  Iain E. Buchan,et al.  Trustworthy reuse of health data: A transnational perspective , 2013, Int. J. Medical Informatics.

[8]  J. Steiner,et al.  A pragmatic framework for single-site and multisite data quality assessment in electronic health record-based clinical research. , 2012, Medical care.

[9]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[10]  Robert A. Israel,et al.  International Classification of Diseases (ICD) , 2005 .

[11]  Jaume Bacardit,et al.  Hard Data Analytics Problems Make for Better Data Analysis Algorithms: Bioinformatics as an Example , 2014, Big Data.