Detection of spatial and spatio-temporal clusters

This thesis develops a general and powerful statistical framework for the automatic detection of spatial and space-time clusters. Our "generalized spatial scan" framework is a flexible, model-based framework for accurate and computationally efficient cluster detection in diverse application domains. Through the development of the "fast spatial scan" algorithm and new Bayesian cluster detection methods, we can now detect clusters hundreds or thousands of times faster than previous approaches. More timely detection of emerging clusters (with high detection power and low false positive rates) was made possible by development of "expectation-based" scan statistics, which learn baseline models from past data then detect regions that are anomalous given these expectations. These cluster detection methods were applied to two real-world problem domains: the early detection of emerging disease epidemics, and the detection of clusters of activity in fMRI brain imaging data. One major contribution of this work is the development of the SSS system for nationwide disease surveillance, currently used in daily practice by several state and local health departments. This system receives data (including emergency department records and medication sales) from over 20,000 stores and hospitals nationwide, automatically detects emerging clusters of disease, and reports these results to public health officials. Through retrospective case studies and semi-synthetic testing, we have shown that our system can detect outbreaks significantly faster than previous disease surveillance methods.

[1]  J. Snow On the Mode of Communication of Cholera , 1856, Edinburgh medical journal.

[2]  A. R. Crathorne,et al.  Economic Control of Quality of Manufactured Product. , 1933 .

[3]  C. Bonferroni Il calcolo delle assicurazioni su gruppi di teste , 1935 .

[4]  E. S. Page CONTINUOUS INSPECTION SCHEMES , 1954 .

[5]  M. Dwass Modified Randomization Tests for Nonparametric Hypotheses , 1957 .

[6]  R. Serfling Methods for current statistical analysis of excess pneumonia-influenza deaths. , 1963, Public health reports.

[7]  E G Knox,et al.  The Detection of Space‐Time Interactions , 1964 .

[8]  J. Naus The Distribution of the Size of the Maximum Cluster of Points on a Line , 1965 .

[9]  N. Mantel The detection of disease clustering and a generalized regression approach. , 1967, Cancer research.

[10]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[11]  R. Adler,et al.  The Geometry of Random Fields , 1982 .

[12]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[14]  Jonathan D. Cryer,et al.  Time Series Analysis , 1986, Encyclopedia of Big Data.

[15]  J. Stuart Hunter,et al.  The exponentially weighted moving average , 1986 .

[16]  D. Clayton,et al.  Empirical Bayes estimates of age-standardized relative risks for use in disease mapping. , 1987, Biometrics.

[17]  N. Breslow,et al.  Statistical methods in cancer research. Volume II--The design and analysis of cohort studies. , 1987, IARC scientific publications.

[18]  A. Whittemore,et al.  A test to detect clusters of disease , 1987 .

[19]  A. Craft,et al.  INVESTIGATION OF LEUKAEMIA CLUSTERS BY USE OF A GEOGRAPHICAL ANALYSIS MACHINE , 1988, The Lancet.

[20]  R. A. Stone Investigations of excess environmental risks around putative sources: statistical problems and a proposed test. , 1988, Statistics in medicine.

[21]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[22]  J. Besag,et al.  Generalized Monte Carlo significance tests , 1989 .

[23]  A. J. Fox Statistical methods in cancer research: Volume 2. The design and analysis of cohort studies , 1989 .

[24]  M. Hills,et al.  Statistical methods used in assessing the risk of disease near a source of possible environmental pollution: a review , 1989 .

[25]  J. Cuzick,et al.  Spatial clustering for inhomogeneous populations , 1990 .

[26]  P. Diggle A point process modeling approach to raised incidence of a rare phenomenon in the vicinity of a prespecified point , 1990 .

[27]  B. Turnbull,et al.  Monitoring for clusters of disease: application to leukemia incidence in upstate New York. , 1990, American journal of epidemiology.

[28]  Julian Besag,et al.  The Detection of Clusters in Rare Diseases , 1991 .

[29]  J. Besag,et al.  Bayesian image restoration, with two applications in spatial statistics , 1991 .

[30]  C. Loader Large-deviation approximations to the distribution of scan statistics , 1991, Advances in Applied Probability.

[31]  P J Diggle,et al.  Second-order analysis of spatial clustering for inhomogeneous populations. , 1991, Biometrics.

[32]  Alan C. Evans,et al.  A Three-Dimensional Statistical Analysis for CBF Activation Studies in Human Brain , 1992, Journal of cerebral blood flow and metabolism : official journal of the International Society of Cerebral Blood Flow and Metabolism.

[33]  B. Turnbull,et al.  Chronic disease surveillance and testing of clustering of disease and exposure: Application to leukemia incidence and TCE‐contaminated dumpsites in upstate New York , 1992 .

[34]  P Schlattmann,et al.  Mixture models and disease mapping. , 1993, Statistics in medicine.

[35]  Christopher M. Bishop,et al.  Novelty detection and neural network validation , 1994 .

[36]  K. Worsley,et al.  Local Maxima and the Expected Euler Characteristic of Excursion Sets of χ 2, F and t Fields , 1994, Advances in Applied Probability.

[37]  Karl J. Friston,et al.  Statistical parametric maps in functional imaging: A general linear approach , 1994 .

[38]  M. Hugh-jones,et al.  The Sverdlovsk anthrax outbreak of 1979. , 1994, Science.

[39]  M Kulldorff,et al.  Spatial disease clusters: detection and inference. , 1995, Statistics in medicine.

[40]  D. Siegmund,et al.  Testing for a Signal with Unknown Location and Scale in a Stationary Gaussian Random Field , 1995 .

[41]  J. Bithell The choice of test for detecting raised disease risk near a point source. , 1995, Statistics in medicine.

[42]  Andrew W. Moore,et al.  Multiresolution Instance-Based Learning , 1995, IJCAI.

[43]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[44]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[45]  K. Worsley,et al.  Boundary corrections for the expected Euler characteristic of excursion sets of random fields, with an application to astrophysics , 1995, Advances in Applied Probability.

[46]  T Tango,et al.  A class of tests for detecting 'general' and 'focused' clustering of rare diseases. , 1995, Statistics in medicine.

[47]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[48]  Sylvia Richardson,et al.  Bayesian mapping of disease , 1995 .

[49]  L. Bernardinelli,et al.  Bayesian methods for mapping disease risk , 1996 .

[50]  R. Baker Testing for space-time clusters of unknown size , 1996 .

[51]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[52]  M. Kulldorff,et al.  Childhood leukaemia in Sweden: using GIS and a spatial scan statistic for cluster detection. , 1996, Statistics in medicine.

[53]  R. Wolpert,et al.  Spatial Correlation or Spatial Variation? A Comparison of Gamma/Poisson Hierarchical Models , 1996 .

[54]  A. Molli'e Bayesian mapping of disease , 1996 .

[55]  Karl J. Friston,et al.  A unified statistical approach for determining significant signals in images of cerebral activation , 1996, Human brain mapping.

[56]  Sven Erick Alm On the Distributions of Scan Statistics of a Two-Dimensional Poisson Process , 1997, Advances in Applied Probability.

[57]  M. Kulldorff,et al.  Breast cancer clusters in the northeast United States: a geographic analysis. , 1997, American journal of epidemiology.

[58]  P. Green,et al.  On Bayesian Analysis of Mixtures with an Unknown Number of Components (with discussion) , 1997 .

[59]  L. Hutwagner,et al.  Using laboratory-based surveillance data for prevention: an algorithm for detecting Salmonella outbreaks. , 1997, Emerging infectious diseases.

[60]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[61]  D. M. Titterington,et al.  Some Methods for Investigating Spatial Clustering, with Epidemiological Applications , 1997 .

[62]  Bradley P. Carlin,et al.  Hierarchical Spatio-Temporal Mapping of Disease Rates , 1997 .

[63]  M. Kulldorff A spatial scan statistic , 1997 .

[64]  R. Wolpert,et al.  Poisson/gamma random field models for spatial statistics , 1998 .

[65]  Sven Erick Alm Approximation and Simulation of the Distributions of Scan Statistics for Poisson Processes in Higher Dimensions , 1998 .

[66]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[67]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[68]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[69]  M. D’Esposito,et al.  A critique of the use of the Kolmogorov‐Smirnov (KS) statistic for the analysis of BOLD fMRI data , 1998, Magnetic resonance in medicine.

[70]  W. F. Athas,et al.  Evaluating cluster alarms: a space-time scan statistic and brain cancer in Los Alamos, New Mexico. , 1998, American journal of public health.

[71]  J. Duyn,et al.  Investigation of Low Frequency Drift in fMRI Signal , 1999, NeuroImage.

[72]  M. Kulldorff Spatial Scan Statistics: Models, Calculations, and Applications , 1999 .

[73]  William DuMouchel,et al.  Bayesian Data Mining in Large Frequency Tables, with an Application to the FDA Spontaneous Reporting System , 1999 .

[74]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[75]  F. Carrat,et al.  Monitoring epidemiologic surveillance data using hidden Markov models. , 1999, Statistics in medicine.

[76]  E. Lesaffre,et al.  Disease mapping and risk assessment for public health. , 1999 .

[77]  M. Kulldorff,et al.  The Knox Method and Other Tests for Space‐Time Interaction , 1999, Biometrics.

[78]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[79]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[80]  Alan C. Evans,et al.  A general statistical analysis for fMRI data , 2000, NeuroImage.

[81]  M K Clayton,et al.  Bayesian Detection and Modeling of Spatial Disease Clustering , 2000, Biometrics.

[82]  Christopher R. Genovese,et al.  A Bayesian Time-Course Model for Functional Magnetic Resonance Imaging Data , 2000 .

[83]  J. Wakefield,et al.  Spatial epidemiology: methods and applications. , 2000 .

[84]  Matthew K. Belmonte,et al.  Permutation testing made practical for functional magnetic resonance image analysis , 2001, IEEE Transactions on Medical Imaging.

[85]  Martin Kulldorff,et al.  Prospective time periodic geographical disease surveillance using a scan statistic , 2001 .

[86]  Anthony K. H. Tung,et al.  Spatial clustering methods in data mining : A survey , 2001 .

[87]  Peter A. Rogerson,et al.  Monitoring point patterns for the development of space–time clusters , 2001 .

[88]  Dean F. Sittig,et al.  The emerging science of very early detection of disease outbreaks. , 2001, Journal of public health management and practice : JPHMP.

[89]  M. Clayton,et al.  A weighted average likelihood ratio test for spatial clustering of disease , 2001, Statistics in medicine.

[90]  Andrew B. Lawson,et al.  Statistical Methods in Spatial Epidemiology , 2001 .

[91]  Andrew W. Moore,et al.  Rule-based anomaly pattern detection for detecting disease outbreaks , 2002, AAAI/IAAI.

[92]  Andrew B. Lawson,et al.  Spatial cluster modelling , 2002 .

[93]  M. Kulldorff,et al.  A Tree‐Based Scan Statistic for Database Disease Surveillance , 2003, Biometrics.

[94]  Ganapati P. Patil,et al.  Geographic and Network Surveillance via Scan Statistics for Critical Area Detection , 2003 .

[95]  Andrew W. Moore,et al.  A Fast Multi-Resolution Method for Detection of Significant Spatial Disease Clusters , 2003, NIPS.

[96]  Michael M. Wagner,et al.  Technical Description of RODS: A Real-time Public Health Surveillance System , 2003, Journal of the American Medical Informatics Association.

[97]  M. Kulldorff,et al.  Dead Bird Clusters as an Early Warning System for West Nile Virus Activity , 2003, Emerging infectious diseases.

[98]  Ronald E Gangnon,et al.  A hierarchical model for spatially clustered disease rates. , 2003, Statistics in medicine.

[99]  Fu-Chiang Tsui,et al.  Application of Information Technology: Design of a National Retail Data Monitor for Public Health Surveillance , 2003, J. Am. Medical Informatics Assoc..

[100]  Larry Wasserman,et al.  False Discovery Rates for Random Fields , 2003 .

[101]  Andréa Iabrudi Tavares,et al.  An Early Warning System for Space-Time Cluster Detection , 2003, GEOINFO.

[102]  Tom M. Mitchell,et al.  Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects , 2003, NIPS 2003.

[103]  K. Worsley Detecting activation in fMRI data , 2003, Statistical methods in medical research.

[104]  David L. Buckeridge,et al.  An Analytic Framework for Space-Time Aberrancy Detection in Public Health Surveillance Data , 2003, AMIA.

[105]  Andrew W. Moore,et al.  Bayesian Network Anomaly Pattern Detection for Disease Outbreaks , 2003, ICML.

[106]  Jun Zhang,et al.  Detection of Outbreaks from Time Series Data Using Wavelet Transform , 2003, AMIA.

[107]  A Nelson,et al.  National Bioterrorism Syndromic Surveillance Demonstration Program. , 2004, MMWR supplements.

[108]  G. P. Patil,et al.  Upper level set scan statistic for detecting arbitrarily shaped hotspots , 2004, Environmental and Ecological Statistics.

[109]  Andrew W. Moore,et al.  Detecting space-time clusters : prior work and new directions , 2004 .

[110]  Weng-Keen Wong,et al.  Bayesian Biosurveillance of Disease Outbreaks , 2004, UAI.

[111]  Andrew W. Moore,et al.  Rapid detection of significant spatial clusters , 2004, KDD.

[112]  William B. Lober,et al.  Review Paper: Implementing Syndromic Surveillance: A Practical Guide Informed by the Early Experience , 2003, J. Am. Medical Informatics Assoc..

[113]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[114]  Tom M. Mitchell,et al.  Learning to Decode Cognitive States from Brain Images , 2004, Machine Learning.

[115]  Vijay S. Iyengar,et al.  On detecting space-time clusters , 2004, KDD.

[116]  J. Loonsk BioSense--a national initiative for early detection and quantification of public health emergencies. , 2004, MMWR supplements.

[117]  Daniel B. Neill,et al.  National Retail Data Monitor for public health surveillance. , 2004, MMWR supplements.

[118]  M. Kulldorff,et al.  Syndromic surveillance in public health practice, New York City. , 2004, Emerging infectious diseases.

[119]  H. Burkom,et al.  Role of data aggregation in biosurveillance detection strategies with applications from ESSENCE. , 2004, MMWR supplements.

[120]  Renato Assunção,et al.  A Simulated Annealing Strategy for the Detection of Arbitrarily Shaped Spatial Clusters , 2022 .

[121]  David L Buckeridge,et al.  Evaluation of syndromic surveillance systems--design of an epidemic simulation model. , 2004, MMWR supplements.

[122]  Andrew W. Moore,et al.  Detecting Significant Multidimensional Spatial Clusters , 2004, NIPS.

[123]  R. Platt,et al.  A generalized linear mixed models approach for detecting incident clusters of disease in small areas, with an application to biological terrorism. , 2004, American journal of epidemiology.

[124]  T. Tango,et al.  International Journal of Health Geographics a Flexibly Shaped Spatial Scan Statistic for Detecting Clusters , 2005 .

[125]  Daniel B. Neill,et al.  Efficient Scan Statistic Computations , 2005 .

[126]  Allyson M. Abrams,et al.  A model-adjusted space–time scan statistic with an application to syndromic surveillance , 2005, Epidemiology and Infection.

[127]  Andrew W. Moore,et al.  Detection of emerging space-time clusters , 2005, KDD '05.

[128]  J Coberly,et al.  Public health monitoring tools for multiple data streams. , 2005, MMWR supplements.

[129]  Andrew W. Moore,et al.  Detecting anomalous patterns in pharmacy retail data , 2005 .

[130]  T. Minka,et al.  A useful distribution for fitting discrete data: revival of the Conway–Maxwell–Poisson distribution , 2005 .

[131]  Andrew W. Moore,et al.  Algorithms for rapid outbreak detection: a research synthesis , 2005, J. Biomed. Informatics.

[132]  Andrew W. Moore,et al.  A Bayesian Spatial Scan Statistic , 2005, NIPS.

[133]  Andrew B. Lawson,et al.  Spatial and syndromic surveillance for public health , 2005 .

[134]  G. Wallstrom,et al.  High-fidelity injection detectability experiments: a tool for evaluating syndromic surveillance systems. , 2005, MMWR supplements.

[135]  Marcello Pagano,et al.  The interpoint distance distribution as a descriptor of point patterns, with an application to spatial disease clustering , 2005, Statistics in medicine.

[136]  Andrew W. Moore,et al.  Anomalous Spatial Cluster Detection , 2005 .

[137]  M. Kulldorff,et al.  An elliptic spatial scan statistic , 2006, Statistics in medicine.

[138]  Michael M. Wagner,et al.  Handbook of biosurveillance , 2006 .

[139]  Andrew W. Moore,et al.  CHAPTER 16 – Methods for Detecting Spatial and Spatio-Temporal Clusters , 2006 .

[140]  A. Moore,et al.  Wsare: What’s strange about recent events? , 2003, Journal of Urban Health.

[141]  M. Kulldorff,et al.  Evaluation of Spatial Scan Statistics for Irregularly Shaped Clusters , 2006 .

[142]  Joseph S. Lombardo,et al.  A systems overview of the Electronic Surveillance System for the Early Notification of Community-Based Epidemics (ESSENCE II) , 2003, Journal of Urban Health.

[143]  A. Moore,et al.  Monitoring Pharmacy Retail Data for Anomalous Space-Time Clusters , 2006 .

[144]  H. Burkom Biosurveillance applying scan statistics with multiple, disparate data sources , 2003, Journal of Urban Health.

[145]  Suresh Venkatasubramanian,et al.  The hunting of the bump: on maximizing statistical discrepancy , 2006, SODA '06.

[146]  Andrew W. Moore,et al.  An Expectation-Based Scan Statistic for Detection of Space-Time Clusters , 2006 .

[147]  Andrew W. Moore,et al.  T-Cube: A Data Structure for Fast Extraction of Time Series from Large Datasets , 2007 .

[148]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.