Fixed-Background EM Algorithm for Semi-Supervised Anomaly Detection

Aalto University, P.O. Box 11000, FI-00076 Aalto www.aalto.fi Author Tommi Vatanen, Mikael Kuusela, Eric Malmi, Tapani Raiko, Timo Aaltonen and Yoshikazu Nagai Name of the publication Fixed-Background EM Algorithm for Semi-Supervised Anomaly Detection Publisher School of Science Unit Department of Information and Computer Science Series Aalto University publication series SCIENCE + TECHNOLOGY 22/2011 Field of research Computer science Abstract We study a semi-supervised anomaly detection problem where anomalies lie among the normal data. Instead of analyzing individual observations, anomalies are identified collectively based on deviations from the distribution of the normal data. We first model the normal data using a mixture of Gaussians and then use a variant of the EM algorithm to fit a mixture of the normal model and a number of additional Gaussians to an unlabeled data set. The statistical significance of the model is verified using a likelihood ratio test based on nonparametric bootstrapping. Using artificial data, we show that the proposed methodology provides accurate models for the anomalous data and good estimates for the proportion of anomalies in the sample. We apply the method to the search of the Higgs boson in particle physics and show that it is applicable to this type of tasks with little a priori knowledge of the new phenomenon.We study a semi-supervised anomaly detection problem where anomalies lie among the normal data. Instead of analyzing individual observations, anomalies are identified collectively based on deviations from the distribution of the normal data. We first model the normal data using a mixture of Gaussians and then use a variant of the EM algorithm to fit a mixture of the normal model and a number of additional Gaussians to an unlabeled data set. The statistical significance of the model is verified using a likelihood ratio test based on nonparametric bootstrapping. Using artificial data, we show that the proposed methodology provides accurate models for the anomalous data and good estimates for the proportion of anomalies in the sample. We apply the method to the search of the Higgs boson in particle physics and show that it is applicable to this type of tasks with little a priori knowledge of the new phenomenon.

[1]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[2]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[3]  Sameer Singh,et al.  Novelty detection: a review - part 2: : neural network based approaches , 2003, Signal Process..

[4]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[5]  Ralph B. D'Agostino,et al.  Goodness-of-Fit-Techniques , 2020 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Eleazar Eskin,et al.  A GEOMETRIC FRAMEWORK FOR UNSUPERVISED ANOMALY DETECTION: DETECTING INTRUSIONS IN UNLABELED DATA , 2002 .

[8]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[9]  M. Kendall Theoretical Statistics , 1956, Nature.

[10]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[11]  Stephan R. Sain,et al.  Outlier detection from a mixture distribution when training data are unlabeled , 1999, Bulletin of the Seismological Society of America.

[12]  Gunnar Rätsch,et al.  Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[14]  Stephan R. Sain,et al.  A New Test for Outlier Detection from a Multivariate Mixture Distribution , 1997 .

[15]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[16]  Deepak K. Agarwal,et al.  An empirical Bayes approach to detect anomalies in dynamic multidimensional arrays , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[17]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[18]  Massimo Piccardi,et al.  Background subtraction techniques: a review , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[19]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[20]  Padhraic Smyth,et al.  Model selection for probabilistic clustering using cross-validated likelihood , 2000, Stat. Comput..

[21]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[22]  Martin Lauer,et al.  A Mixture Approach to Novelty Detection Using Training Data with Outliers , 2001, ECML.