Anomaly pattern detection in categorical datasets

We propose a new method for detecting patterns of anomalies in categorical datasets. We assume that anomalies are generated by some underlying process which affects only a particular subset of the data. Our method consists of two steps: we first use a "local anomaly detector" to identify individual records with anomalous attribute values, and then detect patterns where the number of anomalous records is higher than expected. Given the set of anomalies flagged by the local anomaly detector, we search over all subsets of the data defined by any set of fixed values of a subset of the attributes, in order to detect self-similar patterns of anomalies. We wish to detect any such subset of the test data which displays a significant increase in anomalous activity as compared to the normal behavior of the system (as indicated by the training data). We perform significance testing to determine if the number of anomalies in any subset of the test data is significantly higher than expected, and propose an efficient algorithm to perform this test over all such subsets of the data. We show that this algorithm is able to accurately detect anomalous patterns in real-world hospital, container shipping and network intrusion data.

[1]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[2]  P. Good,et al.  Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses , 1995 .

[3]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[4]  Paul Helman,et al.  A statistically based system for prioritizing information exploration under uncertainty , 1997, IEEE Trans. Syst. Man Cybern. Part A.

[5]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[6]  Barak A. Pearlmutter,et al.  Detecting intrusions using system calls: alternative data models , 1999, Proceedings of the 1999 IEEE Symposium on Security and Privacy (Cat. No.99CB36344).

[7]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[8]  Martin Mueller,et al.  Self-aware services: using Bayesian networks for detecting anomalies in Internet-based services , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[9]  Andrew W. Moore,et al.  Rule-based anomaly pattern detection for detecting disease outbreaks , 2002, AAAI/IAAI.

[10]  Andrew W. Moore,et al.  Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning , 2003, ICML.

[11]  Andrew W. Moore,et al.  Bayesian Network Anomaly Pattern Detection for Disease Outbreaks , 2003, ICML.

[12]  Hsiu-Sen Chiang,et al.  Internet security: malicious e-mails detection and protection , 2004, Ind. Manag. Data Syst..

[13]  Andrew W. Moore,et al.  Detecting Significant Multidimensional Spatial Clusters , 2004, NIPS.

[14]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[15]  G. Cooper,et al.  The Bayesian aerosol release detector: An algorithm for detecting and characterizing outbreaks caused by an atmospheric release of Bacillus anthracis , 2007, Statistics in medicine.