ZERO++: Harnessing the Power of Zero Appearances to Detect Anomalies in Large-Scale Data Sets

This paper introduces a new unsupervised anomaly detector called ZERO++ which employs the number of zero appearances in subspaces to detect anomalies in categorical data. It is unique in that it works in regions of subspaces that are not occupied by data; whereas existing methods work in regions occupied by data. ZERO++ examines only a small number of low dimensional subspaces to successfully identify anomalies. Unlike existing frequency-based algorithms, ZERO++ does not involve subspace pattern searching. We show that ZERO++ is better than or comparable with the state-of-the-art anomaly detection methods over a wide range of real-world categorical and numeric data sets; and it is efficient with linear time complexity and constant space complexity which make it a suitable candidate for large-scale data sets.

[1]  Elke Achtert,et al.  Interactive data mining with 3D-parallel-coordinate-trees , 2013, SIGMOD '13.

[2]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[3]  Jorma Rissanen,et al.  The Minimum Description Length Principle in Coding and Modeling , 1998, IEEE Trans. Inf. Theory.

[4]  Sean Wallis,et al.  Binomial Confidence Intervals and Contingency Tests: Mathematical Fundamentals and the Evaluation of Alternative Methods , 2013, J. Quant. Linguistics.

[5]  Kai Ming Ting,et al.  LeSiNN: Detecting Anomalies by Identifying Least Similar Nearest Neighbours , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[6]  Karsten M. Borgwardt,et al.  Rapid Distance-Based Outlier Detection via Sampling , 2013, NIPS.

[7]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Christopher Leckie,et al.  R1SVM: A Randomised Nonlinear Approach to Large-Scale Anomaly Detection , 2015, AAAI.

[10]  Arthur Zimek,et al.  Subsampling for efficient and effective unsupervised outlier detection ensembles , 2013, KDD.

[11]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[12]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[13]  Zengyou He,et al.  FP-outlier: Frequent pattern based outlier detection , 2005, Comput. Sci. Inf. Syst..

[14]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[15]  Kai Ming Ting,et al.  Mass estimation , 2012, Machine Learning.

[16]  Chris Jermaine,et al.  Outlier detection by sampling with accuracy guarantees , 2006, KDD '06.

[17]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007 .

[18]  Christos Faloutsos,et al.  Fast and reliable anomaly detection in categorical data , 2012, CIKM.

[19]  Jing Xu,et al.  Intrusion Detection using Continuous Time Bayesian Networks , 2010, J. Artif. Intell. Res..

[20]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[21]  Ying Liu,et al.  Cluster-based outlier detection , 2009, Ann. Oper. Res..

[22]  Marius Kloft,et al.  Toward Supervised Anomaly Detection , 2014, J. Artif. Intell. Res..

[23]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[24]  Michael Georgiopoulos,et al.  A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010, Data Mining and Knowledge Discovery.

[25]  Srinivasan Parthasarathy,et al.  LOADED: link-based outlier and anomaly detection in evolving data sets , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[26]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[27]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[28]  Zengyou He,et al.  A Fast Greedy Algorithm for Outlier Mining , 2005, PAKDD.

[29]  Jilles Vreeken,et al.  The Odd One Out: Identifying and Characterising Anomalies , 2011, SDM.

[30]  Arthur Zimek,et al.  Ensembles for unsupervised outlier detection: challenges and research questions a position paper , 2014, SKDD.

[31]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[32]  Shengrui Wang,et al.  Information-Theoretic Outlier Detection for Large-Scale Categorical Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[33]  Huan Liu,et al.  Discretization: An Enabling Technique , 2002, Data Mining and Knowledge Discovery.

[34]  Jie Chen,et al.  Signaling Potential Adverse Drug Reactions from Administrative Health Databases , 2010, IEEE Transactions on Knowledge and Data Engineering.