A Supervised Learning Process to Validate Online Disease Reports for Use in Predictive Models

Abstract Pathogen distribution models that predict spatial variation in disease occurrence require data from a large number of geographic locations to generate disease risk maps. Traditionally, this process has used data from public health reporting systems; however, using online reports of new infections could speed up the process dramatically. Data from both public health systems and online sources must be validated before they can be used, but no mechanisms exist to validate data from online media reports. We have developed a supervised learning process to validate geolocated disease outbreak data in a timely manner. The process uses three input features, the data source and two metrics derived from the location of each disease occurrence. The location of disease occurrence provides information on the probability of disease occurrence at that location based on environmental and socioeconomic factors and the distance within or outside the current known disease extent. The process also uses validation scores, generated by disease experts who review a subset of the data, to build a training data set. The aim of the supervised learning process is to generate validation scores that can be used as weights going into the pathogen distribution model. After analyzing the three input features and testing the performance of alternative processes, we selected a cascade of ensembles comprising logistic regressors. Parameter values for the training data subset size, number of predictors, and number of layers in the cascade were tested before the process was deployed. The final configuration was tested using data for two contrasting diseases (dengue and cholera), and 66%–79% of data points were assigned a validation score. The remaining data points are scored by the experts, and the results inform the training data set for the next set of predictors, as well as going to the pathogen distribution model. The new supervised learning process has been implemented within our live site and is being used to validate the data that our system uses to produce updated predictive disease maps on a weekly basis.

[1]  Antonio Lima,et al.  Personalized routing for multitudes in smart cities , 2015, EPJ Data Science.

[2]  T. Chai,et al.  Root mean square error (RMSE) or mean absolute error (MAE)? – Arguments against avoiding RMSE in the literature , 2014 .

[3]  Yoram Reich,et al.  Evaluating machine learning models for engineering problems , 1999, Artif. Intell. Eng..

[4]  J Elith,et al.  A working guide to boosted regression trees. , 2008, The Journal of animal ecology.

[5]  Alexander Gammerman,et al.  Hedging Predictions in Machine Learning: The Second Computer Journal Lecture , 2006, Comput. J..

[6]  John S. Brownstein,et al.  The global distribution and burden of dengue , 2013, Nature.

[7]  H. Ranson,et al.  Aging partially restores the efficacy of malaria vector control in insecticide-resistant populations of Anopheles gambiae s.l. from Burkina Faso , 2012, Malaria Journal.

[8]  Robert Sabourin,et al.  Combining Diversity and Classification Accuracy for Ensemble Selection in Random Subspaces , 2006, The 2006 IEEE International Joint Conference on Neural Network Proceedings.

[9]  Antoine Geissbühler,et al.  Learning from imbalanced data in surveillance of nosocomial infection , 2006, Artif. Intell. Medicine.

[10]  Kenneth D. Mandl,et al.  HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports , 2008, Journal of the American Medical Informatics Association.

[11]  Xin Yao,et al.  Evolving hybrid ensembles of learning machines for better generalisation , 2006, Neurocomputing.

[12]  S. Hay,et al.  Providing open access data online to advance malaria research and control , 2013, Malaria Journal.

[13]  Lucila Ohno-Machado,et al.  Logistic regression and artificial neural network classification models: a methodology review , 2002, J. Biomed. Informatics.

[14]  N. Ghanchi,et al.  Genetic diversity of Plasmodium vivax clinical isolates from southern Pakistan using pvcsp and pvmsp1 genetic markers , 2013, Malaria Journal.

[15]  Alexander Gammerman,et al.  Rejoinder Hedging Predictions in Machine Learning , 2007, Comput. J..

[16]  David L. Smith,et al.  Mapping the zoonotic niche of Ebola virus disease in Africa , 2014, eLife.

[17]  Jacek M. Zurada,et al.  Review and performance comparison of SVM- and ELM-based classifiers , 2014, Neurocomputing.

[18]  Catherine L Moyes,et al.  Funding for malaria control 2006–2010: A comprehensive global assessment , 2012, Malaria Journal.

[19]  Laurent Hébert-Dufresne,et al.  Enhancing disease surveillance with novel data streams: challenges and opportunities , 2015, EPJ Data Science.

[20]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[21]  Leo Breiman,et al.  Stacked regressions , 2004, Machine Learning.

[22]  João Gama,et al.  Cascade Generalization , 2000, Machine Learning.

[23]  D. Cummings,et al.  Prediction of Dengue Incidence Using Search Query Surveillance , 2011, PLoS neglected tropical diseases.