Predicting Measles Outbreaks in the United States: Evaluation of Machine Learning Approaches

Background Measles, a highly contagious viral infection, is resurging in the United States, driven by international importation and declining domestic vaccination coverage. Despite this resurgence, measles outbreaks are still rare events that are difficult to predict. Improved methods to predict outbreaks at the county level would facilitate the optimal allocation of public health resources. Objective We aimed to validate and compare extreme gradient boosting (XGBoost) and logistic regression, 2 supervised learning approaches, to predict the US counties most likely to experience measles cases. We also aimed to assess the performance of hybrid versions of these models that incorporated additional predictors generated by 2 clustering algorithms, hierarchical density-based spatial clustering of applications with noise (HDBSCAN) and unsupervised random forest (uRF). Methods We constructed a supervised machine learning model based on XGBoost and unsupervised models based on HDBSCAN and uRF. The unsupervised models were used to investigate clustering patterns among counties with measles outbreaks; these clustering data were also incorporated into hybrid XGBoost models as additional input variables. The machine learning models were then compared to logistic regression models with and without input from the unsupervised models. Results Both HDBSCAN and uRF identified clusters that included a high percentage of counties with measles outbreaks. XGBoost and XGBoost hybrid models outperformed logistic regression and logistic regression hybrid models, with the area under the receiver operating curve values of 0.920-0.926 versus 0.900-0.908, the area under the precision-recall curve values of 0.522-0.532 versus 0.485-0.513, and F2 scores of 0.595-0.601 versus 0.385-0.426. Logistic regression or logistic regression hybrid models had higher sensitivity than XGBoost or XGBoost hybrid models (0.837-0.857 vs 0.704-0.735) but a lower positive predictive value (0.122-0.141 vs 0.340-0.367) and specificity (0.793-0.821 vs 0.952-0.958). The hybrid versions of the logistic regression and XGBoost models had slightly higher areas under the precision-recall curve, specificity, and positive predictive values than the respective models that did not include any unsupervised features. Conclusions XGBoost provided more accurate predictions of measles cases at the county level compared with logistic regression. The threshold of prediction in this model can be adjusted to align with each county’s resources, priorities, and risk for measles. While clustering pattern data from unsupervised machine learning approaches improved some aspects of model performance in this imbalanced data set, the optimal approach for the integration of such approaches with supervised machine learning models requires further investigation.

[1]  World Population Prospects 2022 , 2022, Statistical Papers - United Nations (Ser. A), Population and Vital Statistics Report.

[2]  Caspar Daniel Adenutsi,et al.  Hybrid application of unsupervised and supervised learning in forecasting absolute open flow potential for shale gas reservoirs , 2021, Energy.

[3]  J. Pathak,et al.  Deep significance clustering: a novel approach for identifying risk-stratified and predictive patient subgroups , 2021, J. Am. Medical Informatics Assoc..

[4]  Ayelet Gneezy,et al.  COVID-19 and vaccine hesitancy: A longitudinal study , 2021, PloS one.

[5]  M. Kraemer,et al.  Air Passenger Travel and International Surveillance Data Predict Spatiotemporal Variation in Measles Importations to the United States , 2021, medRxiv.

[6]  Susan Hotle,et al.  The impact of COVID-19 on domestic U.S. air travel operations and commercial airport service , 2020 .

[7]  K. Khan,et al.  Persistence of US measles risk due to vaccine hesitancy and outbreaks abroad , 2020, The Lancet Infectious Diseases.

[8]  T. Wiemken,et al.  Machine Learning in Epidemiology and Health Outcomes Research. , 2020, Annual review of public health.

[9]  Ke Zhou,et al.  Cost‐efficiency disk failure prediction via threshold‐moving , 2020, Concurr. Comput. Pract. Exp..

[10]  Paul A. Gastañaduy,et al.  National update on measles cases and outbreaks — United States, January 1 – October 1, 2019 , 2020, American journal of transplantation : official journal of the American Society of Transplantation and the American Society of Transplant Surgeons.

[11]  Justin Lessler,et al.  What Is Machine Learning: a Primer for the Epidemiologist. , 2019, American journal of epidemiology.

[12]  K. Feemster,et al.  Resurgence of measles in the United States: how did we get here? , 2019, Current opinion in pediatrics.

[13]  J. Singleton,et al.  Vaccination Coverage by Age 24 Months Among Children Born in 2015 and 2016 — National Immunization Survey-Child, United States, 2016–2018 , 2019, MMWR. Morbidity and mortality weekly report.

[14]  Paul A. Gastañaduy,et al.  National Update on Measles Cases and Outbreaks — United States, January 1–October 1, 2019 , 2019, MMWR. Morbidity and mortality weekly report.

[15]  John J. Grefenstette,et al.  Forecasted Size of Measles Outbreaks Associated With Vaccination Exemptions for Schoolchildren , 2019, JAMA network open.

[16]  K. Khan,et al.  Measles resurgence in the USA: how international travel compounds vaccine resistance. , 2019, The Lancet. Infectious diseases.

[17]  M. Jackson,et al.  On the Brink: Why the U.S. is in Danger of Losing Measles Elimination Status. , 2019, Missouri medicine.

[18]  S. Redd,et al.  Increase in Measles Cases - United States, January 1-April 26, 2019. , 2019, MMWR. Morbidity and mortality weekly report.

[19]  R. Seither,et al.  Vaccination Coverage for Selected Vaccines and Exemption Rates Among Children in Kindergarten — United States, 2017–18 School Year , 2018, MMWR. Morbidity and mortality weekly report.

[20]  P. Hotez,et al.  The state of the antivaccine movement in the United States: A focused examination of nonmedical exemptions in states and counties , 2018, PLoS medicine.

[21]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[22]  M. Jit,et al.  Combining serological and contact data to derive target immunity levels for achieving and maintaining measles elimination , 2017, bioRxiv.

[23]  Anshul Kundaje,et al.  Umap and Bismap: quantifying genome and methylome mappability , 2016, bioRxiv.

[24]  Michael Olusegun Akinwande,et al.  Variance Inflation Factor: As a Condition for the Inclusion of Suppressor Variable(s) in Regression Analysis , 2015 .

[25]  Saad B Omer,et al.  Vaccine hesitancy: Causes, consequences, and a call to action. , 2015, Vaccine.

[26]  J. Seward,et al.  Children and Adolescents Unvaccinated against Measles: Geographic Clustering, Parents' Beliefs, and Missed Opportunities , 2015, Public health reports.

[27]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[28]  Martin Kulldorff,et al.  Geographic Clusters in Underimmunization and Vaccine Refusal , 2015, Pediatrics.

[29]  Svetlana Masalovich,et al.  Vaccination Coverage Among Children in Kindergarten — United States, 2013–14 School Year , 2014, MMWR. Morbidity and mortality weekly report.

[30]  Heidi J Larson,et al.  Understanding vaccine hesitancy around vaccines and vaccination from a global perspective: a systematic review of published literature, 2007-2012. , 2014, Vaccine.

[31]  W. Bellini,et al.  Elimination of endemic measles, rubella, and congenital rubella syndrome from the Western hemisphere: the US experience. , 2014, JAMA pediatrics.

[32]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[33]  Carolin Strobl,et al.  The behaviour of random forest permutation-based variable importance measures under predictor correlation , 2010, BMC Bioinformatics.

[34]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[35]  A. Hinman,et al.  Summary and conclusions: measles elimination meeting, 16-17 March 2000. , 2004, The Journal of infectious diseases.

[36]  D. Hartfiel,et al.  Understanding , 2003, Encyclopedia of Evolutionary Psychological Science.

[37]  A. Roux,et al.  A glossary for multilevel analysis , 2002, Journal of epidemiology and community health.

[38]  A. Parant [World population prospects]. , 1990, Futuribles.

[39]  R. Duma The National Foundation for Infectious Diseases. , 1976, The Journal of infectious diseases.

[40]  H. Bedford,et al.  Measles , 1889, BMJ.

[41]  Juan M. Corchado,et al.  A Hybrid Supervised/Unsupervised Machine Learning Approach to Classify Web Services , 2021, PAAMS.

[42]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[43]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[44]  L. Breiman Random Forests , 2001, Machine Learning.

[45]  Michigan.,et al.  Toxicological profile for dichloropropenes , 2008 .