On Detecting Clustered Anomalies Using SCiForest

Detecting local clustered anomalies is an intricate problem for many existing anomaly detection methods. Distance-based and density-based methods are inherently restricted by their basic assumptions--anomalies are either far from normal points or being sparse. Clustered anomalies are able to avoid detection since they defy these assumptions by being dense and, in many cases, in close proximity to normal instances. In this paper, without using any density or distance measure, we propose a new method called SCiForest to detect clustered anomalies. SCiForest separates clustered anomalies from normal points effectively even when clustered anomalies are very close to normal points. It maintains the ability of existing methods to detect scattered anomalies, and it has superior time and space complexities against existing distance-based and density-based methods.

[1]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[2]  David L. Woodruff,et al.  Identification of Outliers in Multivariate Data , 1996 .

[3]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[4]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[6]  J. H. Sheesley Tests for Outlying Observations , 1977 .

[7]  Iva Kelava Donald Ervin Knuth , 2003 .

[8]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[9]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[10]  Pang-Ning Tan,et al.  Outlier Detection Using Random Walks , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[11]  Fabrizio Angiulli,et al.  DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets , 2009, TKDD.

[12]  Andrew W. Moore,et al.  Rule-based anomaly pattern detection for detecting disease outbreaks , 2002, AAAI/IAAI.

[13]  Simon Kasif,et al.  A System for Induction of Oblique Decision Trees , 1994, J. Artif. Intell. Res..

[14]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[15]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[16]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[17]  Raymond T. Ng,et al.  Outliers and data mining: finding exceptions in data , 2002 .

[18]  M. Schilling,et al.  Is Human Height Bimodal? , 2002 .

[19]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[20]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[21]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[22]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[23]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[24]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[25]  Donald Ervin Knuth,et al.  The Art of Computer Programming, 2nd Ed. (Addison-Wesley Series in Computer Science and Information , 1978 .

[26]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[27]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[28]  VARUN CHANDOLA,et al.  Outlier Detection : A Survey , 2007 .