Comparison of machine learning classifiers for influenza detection from emergency department free-text reports

Influenza is a yearly recurrent disease that has the potential to become a pandemic. An effective biosurveillance system is required for early detection of the disease. In our previous studies, we have shown that electronic Emergency Department (ED) free-text reports can be of value to improve influenza detection in real time. This paper studies seven machine learning (ML) classifiers for influenza detection, compares their diagnostic capabilities against an expert-built influenza Bayesian classifier, and evaluates different ways of handling missing clinical information from the free-text reports. We identified 31,268 ED reports from 4 hospitals between 2008 and 2011 to form two different datasets: training (468 cases, 29,004 controls), and test (176 cases and 1620 controls). We employed Topaz, a natural language processing (NLP) tool, to extract influenza-related findings and to encode them into one of three values: Acute, Non-acute, and Missing. Results show that all ML classifiers had areas under ROCs (AUC) ranging from 0.88 to 0.93, and performed significantly better than the expert-built Bayesian model. Missing clinical information marked as a value of missing (not missing at random) had a consistently improved performance among 3 (out of 4) ML classifiers when it was compared with the configuration of not assigning a value of missing (missing completely at random). The case/control ratios did not affect the classification performance given the large number of training cases. Our study demonstrates ED reports in conjunction with the use of ML and NLP with the handling of missing value information have a great potential for the detection of infectious diseases.

[1]  Víctor Hugo Borja-Aburto,et al.  Infection and death from influenza A H1N1 virus in Mexico: a retrospective analysis , 2009, The Lancet.

[2]  Gregory F Cooper,et al.  A comparative analysis of methods for predicting clinical outcomes using high-dimensional genomic datasets. , 2014, Journal of the American Medical Informatics Association : JAMIA.

[3]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[4]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[5]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[6]  Xiuzhen Zhang,et al.  An Empirical Study of Learning from Imbalanced Data , 2011, ADC.

[7]  Michael M. Wagner,et al.  Handbook of biosurveillance , 2006 .

[8]  Giandomenico Spezzano,et al.  An Adaptive Distributed Ensemble Approach to Mine Concept-Drifting Data Streams , 2007 .

[9]  Wendy W Chapman,et al.  Classification of emergency department chief complaints into 7 syndromes: a retrospective analysis of 527,228 patients. , 2005, Annals of emergency medicine.

[10]  Wendy W. Chapman,et al.  Creating a Software Tool for the Clinical Researcher - the IPS System , 2002, American Medical Informatics Association Annual Symposium.

[11]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Richard E. Neapolitan,et al.  Probabilistic reasoning in expert systems - theory and algorithms , 2012 .

[14]  Mike Conway,et al.  Developing an application ontology for mining free text clinical reports: The extended syndromic surveillance ontology , 2010 .

[15]  Dean F. Sittig,et al.  Electronic health records and national patient-safety goals. , 2012, The New England journal of medicine.

[16]  Peter J. Haug,et al.  Exploiting missing clinical data in Bayesian network modeling for predicting medical problems , 2008, J. Biomed. Informatics.

[17]  Tom. Mitchell GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION Machine Learning , 2005 .

[18]  Shyam Visweswaran,et al.  Detection of Patients with Influenza Syndrome Using Machine-Learning Models Learned from Emergency Department Reports , 2013, Online Journal of Public Health Informatics.

[19]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[20]  Marc Lipsitch,et al.  Estimates of the Prevalence of Pandemic (H1N1) 2009, United States, April–July 2009 , 2009, Emerging infectious diseases.

[21]  George Hripcsak,et al.  Automated encoding of clinical documents based on natural language processing. , 2004, Journal of the American Medical Informatics Association : JAMIA.

[22]  Peter L. Elkin,et al.  Comparison of Natural Language Processing Biosurveillance Methods for Identifying Influenza From Encounter Notes , 2012, Annals of Internal Medicine.

[23]  Taghi M. Khoshgoftaar,et al.  An Empirical Study of Learning from Imbalanced Data Using Random Forest , 2007 .

[24]  Gregory F. Cooper,et al.  A Bayesian Method for the Induction of Probabilistic Networks from Data , 1992 .

[25]  Michael M. Wagner,et al.  Value of ICD-9-Coded Chief Complaints for Detection of Epidemics , 2002, J. Am. Medical Informatics Assoc..

[26]  Wendy W Chapman,et al.  C-C1-03: Identifying Respiratory-Related Clinical Conditions From ED Reports With Topaz , 2010, Clinical Medicine & Research.

[27]  Neal Sikka,et al.  Emergency Department Chief Complaint and Diagnosis Data to Detect Influenza-Like Illness with an Electronic Medical Record , 2010, The western journal of emergency medicine.

[28]  Jordi Reina,et al.  Estudio comparativo entre una técnica de reacción en cadena de la polimerasa en transcripción reversa en tiempo real, un método de enzimoinmunoanálisis y el cultivo shell-vial en la detección de virus gripales A y B en pacientes adultos , 2010 .

[29]  Brunhilde Schweiger,et al.  Lessons from a one-year hospital-based surveillance of acute respiratory infections in Berlin- comparing case definitions to monitor influenza , 2012, BMC Public Health.

[30]  Ye Ye,et al.  Research and applications: Influenza detection from emergency department reports using natural language processing and Bayesian network classifiers , 2014, J. Am. Medical Informatics Assoc..

[31]  Jialan Que,et al.  Probabilistic Case Detection for Disease Surveillance Using Data in Electronic Medical Records , 2011, Online Journal of Public Health Informatics.

[32]  Colleen A Bradley,et al.  BioSense: implementation of a National Early Event Detection and Situational Awareness System. , 2005, MMWR supplements.

[33]  Craig A. Morioka,et al.  IndexFinder: A Method of Extracting Key Concepts from Clinical Texts for Indexing , 2003, AMIA.

[34]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[35]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[36]  S. Lindstrom,et al.  Design and Performance of the CDC Real-Time Reverse Transcriptase PCR Swine Flu Panel for Detection of 2009 A (H1N1) Pandemic Influenza Virus , 2011, Journal of Clinical Microbiology.

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  Marek J. Druzdzel,et al.  SMILE: Structural Modeling, Inference, and Learning Engine and GeNIE: A Development Environment for Graphical Decision-Theoretic Models , 1999, AAAI/IAAI.

[39]  I. Barr,et al.  Performance of six influenza rapid tests in detecting human influenza in clinical specimens. , 2007, Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology.

[40]  T. F. Smith,et al.  Real-Time PCR in Clinical Microbiology: Applications for Routine Laboratory Testing , 2006, Clinical Microbiology Reviews.

[41]  Wendy W. Chapman,et al.  ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports , 2009, J. Biomed. Informatics.

[42]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[43]  Mike Conway,et al.  Developing a Biosurveillance Application Ontology for Influenza-Like-Illness , 2010 .

[44]  A. Ghaffari,et al.  Performance comparison of neural network training algorithms in modeling of bimodal drug delivery. , 2006, International journal of pharmaceutics.

[45]  Paola Sebastiani,et al.  Naïve Bayesian Classifier and Genetic Risk Score for Genetic Risk Prediction of a Categorical Trait: Not so Different after all! , 2012, Front. Gene..

[46]  W. Briggs Statistical Methods in the Atmospheric Sciences , 2007 .

[47]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[48]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[49]  G. Cooper,et al.  An efficient bayesian method for predicting clinical outcomes from genome-wide data. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.