Detecting outliers using transduction and statistical testing

Outlier detection can uncover malicious behavior in fields like intrusion detection and fraud analysis. Although there has been a significant amount of work in outlier detection, most of the algorithms proposed in the literature are based on a particular definition of outliers (e.g., density-based), and use ad-hoc thresholds to detect them. In this paper we present a novel technique to detect outliers with respect to an existing clustering model. However, the test can also be successfully utilized to recognize outliers when the clustering information is not available. Our method is based on Transductive Confidence Machines, which have been previously proposed as a mechanism to provide individual confidence measures on classification decisions. The test uses hypothesis testing to prove or disprove whether a point is fit to be in each of the clusters of the model. We experimentally demonstrate that the test is highly robust, and produces very few misdiagnosed points, even when no clustering information is available. Furthermore, our experiments demonstrate the robustness of our method under the circumstances of data contaminated by outliers. We finally show that our technique can be successfully applied to identify outliers in a noisy data set for which no information is available (e.g., ground truth, clustering structure, etc.). As such our proposed methodology is capable of bootstrapping from a noisy data set a clean one that can be used to identify future outliers.

[1]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[2]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[3]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[4]  A. Madansky Identification of Outliers , 1988 .

[5]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 2019, Texts in Computer Science.

[6]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[7]  Michel Verleysen,et al.  Enhanced learning for evolutive neural architectures , 1995 .

[8]  Anne Guerin-dugue,et al.  Deliverable R3-B4-P-Task B4: Benchmarks , 1995 .

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[11]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[12]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[13]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[14]  Alexander Gammerman,et al.  Machine-Learning Applications of Algorithmic Randomness , 1999, ICML.

[15]  San Murugesan Web engineering , 1999, LINK.

[16]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[17]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[18]  Mario A. López,et al.  High dimensional similarity search with space filling curves , 2001, Proceedings 17th International Conference on Data Engineering.

[19]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[20]  Daniel A. Menascé,et al.  Fractal Characterization of Web Workloads , 2002 .

[21]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[22]  Alexander Gammerman,et al.  Transductive Confidence Machines for Pattern Recognition , 2002, ECML.

[23]  Alexander Gammerman,et al.  Prediction algorithms and confidence measures based on algorithmic randomness theory , 2002, Theor. Comput. Sci..

[24]  C. Lu A Uniied Approach to Spatial Outliers Detection , 2003 .

[25]  Harry Wechsler,et al.  Transductive confidence machine for active learning , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[26]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[27]  Jennifer Neville,et al.  Learning relational probability trees , 2003, KDD '03.

[28]  Shashi Shekhar,et al.  A Unified Approach to Detecting Spatial Outliers , 2003, GeoInformatica.

[29]  David M. Rocke,et al.  Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator , 2004, Comput. Stat. Data Anal..

[30]  David Surkov,et al.  Inductive confidence machine for pattern recognition , 2004 .

[31]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[32]  M. Hubert,et al.  Multivariate outlier detection and Robustness , 2005 .