Nonparametric Detection of Anomalous Data via Kernel Mean Embedding

An anomaly detection problem is investigated, in which there are totally n sequences with s anomalous sequences to be detected. Each normal sequence contains m independent and identically distributed (i.i.d.) samples drawn from a distribution p, whereas each anomalous sequence contains m i.i.d. samples drawn from a distribution q that is distinct from p. The distributions p and q are assumed to be unknown a priori. Two scenarios, respectively with and without a reference sequence generated by p, are studied. Distribution-free tests are constructed using maximum mean discrepancy (MMD) as the metric, which is based on mean embeddings of distributions into a reproducing kernel Hilbert space (RKHS). For both scenarios, it is shown that as the number n of sequences goes to infinity, if the value of s is known, then the number m of samples in each sequence should be at the order O(log n) or larger in order for the developed tests to consistently detect s anomalous sequences. If the value of s is unknown, then m should be at the order strictly larger than O(log n). Computational complexity of all developed tests is shown to be polynomial. Numerical results demonstrate that our tests outperform (or perform as well as) the tests based on other competitive traditional statistical approaches and kernel-based approaches under various cases. Consistency of the proposed test is also demonstrated on a real data set.

[1]  H. Vincent Poor,et al.  Quick Search for Rare Events , 2012, IEEE Transactions on Information Theory.

[2]  Sirin Nitinawarat,et al.  Universal outlier hypothesis testing , 2013, 2013 IEEE International Symposium on Information Theory.

[3]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[4]  Bernhard Schölkopf,et al.  Injective Hilbert Space Embeddings of Probability Measures , 2008, COLT.

[5]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[6]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[7]  Takafumi Kanamori,et al.  $f$ -Divergence Estimation and Two-Sample Homogeneity Test Under Semiparametric Density-Ratio Models , 2010, IEEE Transactions on Information Theory.

[8]  H. Vincent Poor,et al.  Quickest Search Over Multiple Sequences , 2011, IEEE Transactions on Information Theory.

[9]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[10]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[11]  P. Hall,et al.  Permutation tests for equality of distributions in high‐dimensional settings , 2002 .

[12]  Zaïd Harchaoui,et al.  Testing for Homogeneity with Kernel Fisher Discriminant Analysis , 2007, NIPS.

[13]  Bernhard Schölkopf,et al.  Characteristic Kernels on Groups and Semigroups , 2008, NIPS.

[14]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[15]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[16]  Bernhard Schölkopf,et al.  Kernel Measures of Conditional Dependence , 2007, NIPS.

[17]  A. Berlinet,et al.  Reproducing kernel Hilbert spaces in probability and statistics , 2004 .

[18]  Jung-Min Park,et al.  An overview of anomaly detection techniques: Existing solutions and latest technological trends , 2007, Comput. Networks.