Simulating and Evaluating Biosurveillance Datasets

Biosurveillance involves monitoring measures of diagnostic and pre-diagnostic activity for early detection of disease outbreaks. Modern biosurveillance data include daily counts of diagnostic evidence such as lab results, and pre-diagnostic health seeking behavior such as medication sales. A serious challenge to research in the field of biosurveillance is the lack of available authentic data to researchers. This significantly limits the possibility of algorithm development and evaluation and hinders the comparison of methods across different groups of researchers. Since biosurveillance datasets are usually proprietary and tightly held by their owners, an alternative is generating simulated or semi-authentic data that are similar to authentic datasets. This paper describes a method for simulating multivariate biosurveillance time series, in the form of daily counts from multiple biosurveillance series, by using statistics from authentic biosurveillance data. Moreover, it uses statistical methods to test the validity of these simulated series, testing whether they could reasonably have come from the same distribution as the authentic series. We make simulator software and datasets publicly available.

[1]  Lars Bergman,et al.  Computer-aided DSM-IV-diagnostics – acceptance, use and perceived usefulness in relation to users' learning styles , 2005, BMC Medical Informatics Decis. Mak..

[2]  P. Hall,et al.  Permutation tests for equality of distributions in high‐dimensional settings , 2002 .

[3]  Kenneth D. Mandl,et al.  Time series modeling for syndromic surveillance , 2003, BMC Medical Informatics Decis. Mak..

[4]  Galit Shmueli,et al.  Statistical issues and challenges associated with rapid detection of bio‐terrorist attacks , 2005, Statistics in medicine.

[5]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[6]  G. Wallstrom,et al.  High-fidelity injection detectability experiments: a tool for evaluating syndromic surveillance systems. , 2005, MMWR supplements.

[7]  Daniel Alexander Ford,et al.  An extensible spatial and temporal epidemiological modelling system , 2006, International journal of health geographics.

[8]  Bert Veenendaal,et al.  Using GIS to create synthetic disease outbreaks , 2007, BMC Medical Informatics Decis. Mak..

[9]  Galit Shmueli,et al.  Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Robert V. Foutz,et al.  Tests for the multivariate two‐sample problem based on empirical probability measures , 1987 .

[11]  Tom Burr,et al.  Modeling emergency department visit patterns for infectious disease complaints: results and application to disease surveillance , 2005, BMC Medical Informatics Decis. Mak..

[12]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[13]  Andrew W. Moore,et al.  Algorithms for rapid outbreak detection: a research synthesis , 2005, J. Biomed. Informatics.

[14]  J. Pavlin,et al.  Bio-ALIRT biosurveillance detection algorithm evaluation. , 2004, MMWR supplements.

[15]  Lori Hutwagner,et al.  Comparing Aberration Detection Methods with Simulated Data , 2005, Emerging infectious diseases.

[16]  H. Burkom Development, adaptation, and assessment of alerting algorithms for biosurveillance , 2003 .

[17]  P. Bickel A Distribution Free Version of the Smirnov Two Sample Test in the $p$-Variate Case , 1969 .

[18]  Ronald D Fricker,et al.  Comparing syndromic surveillance detection methods: EARS' versus a CUSUM‐based methodology , 2008, Statistics in medicine.

[19]  Kenneth D Mandl,et al.  Measuring outbreak-detection performance by using controlled feature set simulations. , 2004, MMWR supplements.

[20]  Howard S. Burkom,et al.  Statistical Challenges Facing Early Outbreak Detection in Biosurveillance , 2010, Technometrics.

[21]  Marcello Pagano,et al.  Using temporal context to improve biosurveillance , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[23]  Sean Murphy,et al.  Preparing Biosurveillance Data for Classic Monitoring , 2007 .

[24]  H. Burkom Biosurveillance applying scan statistics with multiple, disparate data sources , 2003, Journal of Urban Health.

[25]  Galit Shmueli,et al.  Automated time series forecasting for biosurveillance , 2007, Statistics in medicine.

[26]  Byron Boots,et al.  Learning Stable Multivariate Baseline Models for Outbreak Detection , 2007 .

[27]  Ronald D. Fricker,et al.  Directionally Sensitive Multivariate Statistical Process Control Methods with Application to Syndromic Surveillance , 2007 .

[28]  Andrew W. Moore,et al.  Rule-based anomaly pattern detection for detecting disease outbreaks , 2002, AAAI/IAAI.

[29]  Ronald D. Fricker,et al.  Evaluating Statistical Methods for Syndromic Surveillance , 2006 .

[30]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[31]  David L Buckeridge,et al.  Evaluation of syndromic surveillance systems--design of an epidemic simulation model. , 2004, MMWR supplements.

[32]  Kenneth D. Mandl,et al.  A software tool for creating simulated outbreaks to benchmark surveillance systems , 2005, BMC Medical Informatics Decis. Mak..

[33]  Jun Zhang,et al.  Detection of Outbreaks from Time Series Data Using Wavelet Transform , 2003, AMIA.

[34]  Andrew W. Moore,et al.  Bayesian Network Anomaly Pattern Detection for Disease Outbreaks , 2003, ICML.