Knowledge and Information Systems Class Separation through Variance : A new Application of Outlier Detection

This paper introduces a new outlier detection approach and discusses and extends a new concept, class separation through variance. We show that even for balanced and concentric classes differing only in variance, accumulating information about the outlierness of points in multiple subspaces leads to a ranking in which the classes naturally tend to separate. Exploiting this leads to a highly effective and efficient unsupervised class separation approach. Unlike typical outlier detection algorithms, this method can be applied beyond the ‘rare classes’ case with great success. The new algorithm FASTOUT introduces a number of novel features. It employs sampling of subspaces points and is highly efficient. It handles arbitrarily sized subspaces and converges to an optimal subspace size through the use of an objective function. In addition, two approaches are presented for automatically deriving the class of the data points from the ranking. Experiments show that FASTOUT typically outperforms other state-of-the-art outlier detection methods on high-dimensional data such as Feature Bagging, SOE1, LOF, ORCA and Robust Mahalanobis Distance, and competes even with the leading supervised classification methods for separating classes.

[1]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[2]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[3]  Zhi-Hua Zhou,et al.  Editing Training Data for kNN Classifiers with Neural Network Ensemble , 2004, ISNN.

[4]  Hui Wang,et al.  Nearest neighbors by neighborhood counting , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Osmar R. Zaïane,et al.  A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[6]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[7]  Mikhail Petrovskiy,et al.  Outlier Detection Algorithms in Data Mining Systems , 2003, Programming and Computer Software.

[8]  Walter A. Kosters,et al.  Genetic Programming for data classification: partitioning the search space , 2004, SAC '04.

[9]  P. Rousseeuw,et al.  A fast algorithm for the minimum covariance determinant estimator , 1999 .

[10]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[11]  Lukasz A. Kurgan,et al.  Knowledge discovery approach to automated cardiac SPECT diagnosis , 2001, Artif. Intell. Medicine.

[12]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[13]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[14]  Xiangyang Li,et al.  A supervised clustering algorithm for computer intrusion detection , 2005, Knowledge and Information Systems.

[15]  Sebastián Ventura,et al.  Multi-objective Genetic Programming for Multiple Instance Learning , 2007, ECML.

[16]  Zhi-Hua Zhou,et al.  Semi-supervised learning by disagreement , 2010, Knowledge and Information Systems.

[17]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[18]  Zengyou He,et al.  A Unified Subspace Outlier Ensemble Framework for Outlier Detection , 2005, WAIM.

[19]  Klaus-Robert Müller,et al.  From outliers to prototypes: Ordering data , 2006, Neurocomputing.

[20]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[21]  Devdatt P. Dubhashi,et al.  Concentration of Measure for the Analysis of Randomized Algorithms: Chernoff–Hoeffding Bounds in Dependent Settings , 2009 .

[22]  Luis Rueda,et al.  On the Performance of Chernoff-Distance-Based Linear Dimensionality Reduction Techniques , 2006, Canadian Conference on AI.

[23]  Xuxian Jiang,et al.  vEye: behavioral footprinting for self-propagating worm detection and profiling , 2008, Knowledge and Information Systems.

[24]  Latifur Khan,et al.  Multimedia Data Mining and Knowledge Discovery , 2006 .

[25]  Philip S. Yu,et al.  An effective and efficient algorithm for high-dimensional outlier detection , 2005, The VLDB Journal.

[26]  Hyunsoo Kim,et al.  Data Reduction in Support Vector Machines by a Kernelized Ionic Interaction Model , 2004, SDM.

[27]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[28]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[29]  S YuPhilip,et al.  Outlier detection for high dimensional data , 2001 .

[30]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[31]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[32]  M. Ledoux The concentration of measure phenomenon , 2001 .

[33]  Osmar R. Zaïane,et al.  Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data , 2008, Knowledge and Information Systems.

[34]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[35]  Ji Zhang,et al.  Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance , 2006, Knowledge and Information Systems.

[36]  Osmar R. Zaïane,et al.  Unsupervised Class Separation of Multivariate Data through Cumulative Variance-Based Ranking , 2009, 2009 Ninth IEEE International Conference on Data Mining.