Near-Optimal and Practical Algorithms for Graph Scan Statistics with Connectivity Constraints

One fundamental task in network analysis is detecting “hotspots” or “anomalies” in the network; that is, detecting subgraphs where there is significantly more activity than one would expect given historical data or some baseline process. Scan statistics is one popular approach used for anomalous subgraph detection. This methodology involves maximizing a score function over all connected subgraphs, which is a challenging computational problem. A number of heuristics have been proposed for these problems, but they do not provide any quality guarantees. Here, we propose a framework for designing algorithms for optimizing a large class of scan statistics for networks, subject to connectivity constraints. Our algorithms run in time that scales linearly on the size of the graph and depends on a parameter we call the “effective solution size,” while providing rigorous approximation guarantees. In contrast, most prior methods have super-linear running times in terms of graph size. Extensive empirical evidence demonstrates the effectiveness and efficiency of our proposed algorithms in comparison with state-of-the-art methods. Our approach improves on the performance relative to all prior methods, giving up to over 25% increase in the score. Further, our algorithms scale to networks with up to a million nodes, which is 1--2 orders of magnitude larger than all prior applications.

[1]  Ambuj K. Singh,et al.  Mining Heavy Subgraphs in Time-Evolving Networks , 2011, 2011 IEEE 11th International Conference on Data Mining.

[2]  Toshiro Tango,et al.  International Journal of Health Geographics a Flexibly Shaped Space-time Scan Statistic for Disease Outbreak Detection and Monitoring , 2022 .

[3]  Jeff W. Lingwall,et al.  A Nonparametric Scan Statistic for Multivariate Disease Surveillance , 2007 .

[4]  Aristides Gionis,et al.  Event detection in activity networks , 2014, KDD.

[5]  Daniel B. Neill Fast and Flexible Outbreak Detection by Linear-Time Subset Scanning , 2008 .

[6]  Alessandro Rinaldo,et al.  Sparsistency of the Edge Lasso over Graphs , 2012, AISTATS.

[7]  Milind R. Naphade,et al.  Regional behavior change detection via local spatial scan , 2010, GIS '10.

[8]  Ambuj K. Singh,et al.  NetSpot: Spotting Significant Anomalous Regions on Dynamic Networks , 2013, SDM.

[9]  Martin Kulldorff,et al.  Public domain small-area cancer incidence data for New York State, 2005-2009 , 2016, Geospatial health.

[10]  Jiashun Jin,et al.  Rare and Weak effects in Large-Scale Inference: methods and phase diagrams , 2014, 1410.4578.

[11]  Daniel B. Neill,et al.  Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs , 2014, KDD.

[12]  Pavla Vaneckova,et al.  Spatial analysis of heat-related mortality among the elderly between 1993 and 2004 in Sydney, Australia. , 2010, Social science & medicine.

[13]  Eugene S. Edgington,et al.  An Additive Method for Combining Probability Values from Independent Experiments , 1972 .

[14]  Jure Leskovec,et al.  {SNAP Datasets}: {Stanford} Large Network Dataset Collection , 2014 .

[15]  Daniel B Neill,et al.  An empirical comparison of spatial scan statistics for outbreak detection , 2009, International journal of health geographics.

[16]  David S. Johnson,et al.  The prize collecting Steiner tree problem: theory and practice , 2000, SODA '00.

[17]  Zhengyuan Zhu,et al.  Spatial scan statistics: approximations and performance study , 2006, KDD '06.

[18]  David P. Williamson,et al.  A general approximation technique for constrained forest problems , 1992, SODA '92.

[19]  J. Wellner,et al.  GOODNESS-OF-FIT TESTS VIA PHI-DIVERGENCES , 2006, math/0603238.

[20]  D. Neill,et al.  Scalable Detection of Anomalous Patterns With Connectivity Constraints , 2015 .

[21]  Peter J. Park,et al.  Power comparisons for disease clustering tests , 2003, Comput. Stat. Data Anal..

[22]  Douglas H. Jones,et al.  Goodness-of-fit test statistics that dominate the Kolmogorov statistics , 1979 .

[23]  M. Kulldorff,et al.  Evaluation of Spatial Scan Statistics for Irregularly Shaped Clusters , 2006 .

[24]  Inkyung Jung,et al.  A spatial scan statistic for multinomial data , 2010, Statistics in medicine.

[25]  E. Suchman,et al.  The American Soldier: Adjustment During Army Life. , 1949 .

[26]  Raul H. C. Lopes,et al.  Pengaruh Latihan Small Sided Games 4 Lawan 4 Dengan Maksimal Tiga Sentuhan Terhadap Peningkatan VO2MAX Pada Siswa SSB Tunas Muda Bragang Klampis U-15 , 2022, Jurnal Ilmiah Mandala Education.

[27]  Sue C. Grady,et al.  Homicide as Infectious Disease: Using Public Health Methods to Investigate the Diffusion of Homicide , 2014 .

[28]  Akshay Krishnamurthy,et al.  Near-optimal Anomaly Detection in Graphs using Lovasz Extended Scan Statistic , 2013, NIPS.

[29]  Mam Riess Jones Color Coding , 1962, Human factors.

[30]  Anil Vullikanti,et al.  Graph Anomaly Detection Based on Steiner Connectivity and Density , 2018, Proceedings of the IEEE.

[31]  Billy I. Ross,et al.  The American Soldier. , 1898 .

[32]  Jiashun Jin,et al.  Higher Criticism for Large-Scale Inference: especially for Rare and Weak effects , 2014, 1410.4743.

[33]  Hyun Ah Song,et al.  FRAUDAR: Bounding Graph Fraud in the Face of Camouflage , 2016, KDD.

[34]  Avi Ostfeld,et al.  The Battle of the Water Sensor Networks (BWSN): A Design Challenge for Engineers and Algorithms , 2008 .

[35]  L. Schmetterer Zeitschrift fur Wahrscheinlichkeitstheorie und Verwandte Gebiete. , 1963 .

[36]  O. Sankoh,et al.  Spatial variations in childhood mortalities at the Dodowa Health and Demographic Surveillance System site of the INDEPTH Network in Ghana , 2010, Tropical medicine & international health : TM & IH.

[37]  Daniel B. Neill,et al.  Human Rights Event Detection from Heterogeneous Social Media Graphs , 2015, Big Data.

[38]  Danai Koutra,et al.  Graph based anomaly detection and description: a survey , 2014, Data Mining and Knowledge Discovery.

[39]  Pemetaan Jumlah Balita,et al.  Spatial Scan Statistic , 2014, Encyclopedia of Social Network Analysis and Mining.

[40]  Yi-Kuo Yu,et al.  Accuracy Evaluation of the Unified P-Value from Combining Correlated P-Values , 2014, PloS one.

[41]  Venkatesh Saligrama,et al.  Connected Sub-graph Detection , 2014, AISTATS.

[42]  F. Eicker The Asymptotic Distribution of the Suprema of the Standardized Empirical Processes , 1979 .

[43]  Alessandro Rinaldo,et al.  Changepoint Detection over Graphs with the Spectral Scan Statistic , 2012, AISTATS.

[44]  Arthur Cohen,et al.  Asymptotically Optimal Methods of Combining Tests , 1979 .

[45]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[46]  Daniel B. Neill,et al.  Fast generalized subset scan for anomalous pattern detection , 2013, J. Mach. Learn. Res..

[47]  Daniel B. Neill,et al.  Dynamic Pattern Detection with Temporal Consistency and Connectivity Constraints , 2013, 2013 IEEE 13th International Conference on Data Mining.

[48]  Lawrence B. Holder,et al.  Graph-based approaches to insider threat detection , 2009, CSIIRW '09.

[49]  Christos Faloutsos,et al.  oddball: Spotting Anomalies in Weighted Graphs , 2010, PAKDD.

[50]  Paul Barford,et al.  Intrusion as (anti)social communication: characterization and detection , 2012, KDD.

[51]  Florence M. Margai,et al.  A community-based assessment of learning disabilities using environmental and contextual risk factors. , 2003, Social science & medicine.

[52]  Daniel B. Neill,et al.  Non-Parametric Scan Statistics for Disease Outbreak Detection on Twitter , 2014, Online Journal of Public Health Informatics.

[53]  Daniel B. Neill,et al.  Fast subset scan for spatial pattern detection , 2012 .

[54]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.