HiCS: High Contrast Subspaces for Density-Based Outlier Ranking

Outlier mining is a major task in data analysis. Outliers are objects that highly deviate from regular objects in their local neighborhood. Density-based outlier ranking methods score each object based on its degree of deviation. In many applications, these ranking methods degenerate to random listings due to low contrast between outliers and regular objects. Outliers do not show up in the scattered full space, they are hidden in multiple high contrast subspace projections of the data. Measuring the contrast of such subspaces for outlier rankings is an open research challenge. In this work, we propose a novel subspace search method that selects high contrast subspaces for density-based outlier ranking. It is designed as pre-processing step to outlier ranking algorithms. It searches for high contrast subspaces with a significant amount of conditional dependence among the subspace dimensions. With our approach, we propose a first measure for the contrast of subspaces. Thus, we enhance the quality of traditional outlier rankings by computing outlier scores in high contrast projections only. The evaluation on real and synthetic data shows that our approach outperforms traditional dimensionality reduction techniques, naive random projections as well as state-of-the-art subspace search techniques and provides enhanced quality for outlier ranking.

[1]  Peter Filzmoser,et al.  Outlier identification in high dimensions , 2008, Comput. Stat. Data Anal..

[2]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[3]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[4]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[5]  Hans-Peter Kriegel,et al.  Subspace selection for clustering high-dimensional data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[6]  Hans-Peter Kriegel,et al.  Interpreting and Unifying Outlier Scores , 2011, SDM.

[7]  M. Stephens Use of the Kolmogorov-Smirnov, Cramer-Von Mises and Related Statistics without Extensive Tables , 1970 .

[8]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[9]  Hans-Peter Kriegel,et al.  Ranking Interesting Subspaces for Clustering High Dimensional Data , 2003, PKDD.

[10]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[11]  F. E. Satterthwaite An approximate distribution of estimates of variance components. , 1946, Biometrics.

[12]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[13]  I. Jolliffe Principal Component Analysis , 2002 .

[14]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[15]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[16]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[17]  Ira Assent,et al.  Relevant Subspace Clustering: Mining the Most Interesting Non-redundant Concepts in High Dimensional Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[18]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[19]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[20]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[21]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[22]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[23]  Satterthwaite Fe An approximate distribution of estimates of variance components. , 1946 .

[24]  Emmanuel Müller,et al.  Adaptive outlierness for subspace outlier ranking , 2010, CIKM '10.

[25]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[26]  Emmanuel Müller,et al.  Statistical selection of relevant subspace projections for outlier ranking , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[27]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[28]  Michael I. Jordan,et al.  Multiple Non-Redundant Spectral Clustering Views , 2010, ICML.

[29]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[30]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.