Statistical Analysis of Nearest Neighbor Methods for Anomaly Detection

Nearest-neighbor (NN) procedures are well studied and widely used in both supervised and unsupervised learning problems. In this paper we are concerned with investigating the performance of NN-based methods for anomaly detection. We first show through extensive simulations that NN methods compare favorably to some of the other state-of-the-art algorithms for anomaly detection based on a set of benchmark synthetic datasets. We further consider the performance of NN methods on real datasets, and relate it to the dimensionality of the problem. Next, we analyze the theoretical properties of NN-methods for anomaly detection by studying a more general quantity called distance-to-measure (DTM), originally developed in the literature on robust geometric and topological inference. We provide finite-sample uniform guarantees for the empirical DTM and use them to derive misclassification rates for anomalous observations under various settings. In our analysis we rely on Huber's contamination model and formulate mild geometric regularity assumptions on the underlying distribution of the data.

[1]  Charu C. Aggarwal,et al.  Outlier Detection with Autoencoder Ensembles , 2017, SDM.

[2]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  Kai Ming Ting,et al.  Defying the gravity of learning curve: a characteristic of nearest neighbour anomaly detectors , 2016, Machine Learning.

[4]  Frédéric Chazal,et al.  Convergence rates for persistence diagram estimation in topological data analysis , 2014, J. Mach. Learn. Res..

[5]  R. Tibshirani,et al.  Prediction and outlier detection: a distribution-free prediction set with a balanced objective , 2019 .

[6]  Somesh Jha,et al.  Analyzing the Robustness of Nearest Neighbors to Adversarial Examples , 2017, ICML.

[7]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[8]  Andrea Bondavalli,et al.  Quantitative comparison of unsupervised anomaly detection algorithms for intrusion detection , 2019, SAC.

[9]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[10]  Alfred O. Hero,et al.  Geometric entropy minimization (GEM) for anomaly detection and localization , 2006, NIPS.

[11]  Thomas G. Dietterich,et al.  Systematic construction of anomaly detection benchmarks from real data , 2013, ODD '13.

[12]  Kai Ming Ting,et al.  LeSiNN: Detecting Anomalies by Identifying Least Similar Nearest Neighbours , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[13]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[14]  Frédéric Chazal,et al.  Geometric Inference for Probability Measures , 2011, Found. Comput. Math..

[15]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[16]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[17]  Ulrike von Luxburg,et al.  Consistent Procedures for Cluster Tree Estimation and Pruning , 2014, IEEE Transactions on Information Theory.

[18]  Kai Ming Ting,et al.  Efficient Anomaly Detection by Isolation Using Nearest Neighbour Ensemble , 2014, 2014 IEEE International Conference on Data Mining Workshop.

[19]  P. J. Huber A Robust Version of the Probability Ratio Test , 1965 .

[20]  Frédéric Chazal,et al.  Robust Topological Inference: Distance To a Measure and Kernel Distance , 2014, J. Mach. Learn. Res..

[21]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[22]  Karsten M. Borgwardt,et al.  Rapid Distance-Based Outlier Detection via Sampling , 2013, NIPS.

[23]  Clayton D. Scott,et al.  Robust kernel density estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Chris Jermaine,et al.  Outlier detection by sampling with accuracy guarantees , 2006, KDD '06.

[25]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[26]  Xiaojie Li,et al.  Angle-Based Outlier Detection Algorithm with More Stable Relationships , 2015 .

[27]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[28]  Seiichi Uchida,et al.  A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data , 2016, PloS one.

[29]  A. Cuevas,et al.  A plug-in approach to support estimation , 1997 .

[30]  Tomás Pevný,et al.  Loda: Lightweight on-line detector of anomalies , 2016, Machine Learning.

[31]  Thomas G. Dietterich,et al.  A Meta-Analysis of the Anomaly Detection Problem , 2015 .

[32]  Chandan Srivastava,et al.  Support Vector Data Description , 2011 .

[33]  Alfred O. Hero,et al.  Efficient anomaly detection using bipartite k-NN graphs , 2011, NIPS.