Dimension-based subspace search for outlier detection

Scientific data often are high dimensional. In such data, finding outliers are challenging because they often are hidden in subspaces, i.e., lower-dimensional projections of the data. With recent approaches to outlier mining, the actual detection of outliers is decoupled from the search for subspaces likely to contain outliers. However, finding such sets of subspaces that contain most or even all outliers of the given data set remains an open problem. While previous proposals use per-subspace measures such as correlation in order to quantify the quality of subspaces, we explicitly take the relationship between subspaces into account and propose a dimension-based measure of that quality. Based on it, we formalize the notion of an optimal set of subspaces and propose the Greedy Maximum Deviation heuristic to approximate this set. Experiments on comprehensive benchmark data show that our concept is more effective in determining the relevant set of subspaces than approaches which use per-subspace measures.

[1]  Klemens Böhm,et al.  CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection , 2013, SDM.

[2]  Arthur Zimek,et al.  Ensembles for unsupervised outlier detection: challenges and research questions a position paper , 2014, SKDD.

[3]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[4]  Ling Chen,et al.  Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[5]  Ji Zhang,et al.  A Novel Method for Detecting Outlying Subspaces in High-dimensional Databases Using Genetic Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[6]  Sylvie Ratté,et al.  Bagged Subspaces for Unsupervised Outlier Detection , 2017, Comput. Intell..

[7]  Klemens Böhm,et al.  OutRules: A Framework for Outlier Descriptions in Multiple Context Spaces , 2012, ECML/PKDD.

[8]  Emmanuel Müller,et al.  Statistical selection of relevant subspace projections for outlier ranking , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[9]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[10]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[11]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[12]  James Bailey,et al.  Efficient discovery of contrast subspaces for object explanation and characterization , 2015, Knowledge and Information Systems.

[13]  Ling Chen,et al.  Learning Homophily Couplings from Non-IID Data for Joint Feature Selection and Noise-Resilient Outlier Detection , 2017, IJCAI.

[14]  Vivekanand Gopalkrishnan,et al.  Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces , 2010, DASFAA.

[15]  Ira Assent,et al.  Explaining Outliers by Subspace Separability , 2013, 2013 IEEE 13th International Conference on Data Mining.

[16]  Charu C. Aggarwal,et al.  Subspace Outlier Detection in Linear Time with Randomized Hashing , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[17]  James Bailey,et al.  Mining outlying aspects on numeric data , 2015, Data Mining and Knowledge Discovery.

[18]  Luigi Palopoli,et al.  Outlying property detection with numerical attributes , 2013, Data Mining and Knowledge Discovery.

[19]  Klemens Böhm,et al.  Flexible and adaptive subspace search for outlier analysis , 2013, CIKM.

[20]  Hans-Peter Kriegel,et al.  Outlier Detection in Arbitrarily Oriented Subspaces , 2012, 2012 IEEE 12th International Conference on Data Mining.

[21]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[22]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[23]  James Bailey,et al.  Discovering outlying aspects in large datasets , 2016, Data Mining and Knowledge Discovery.

[24]  Klemens Böhm,et al.  4S: Scalable subspace search scheme overcoming traditional Apriori processing , 2013, 2013 IEEE International Conference on Big Data.

[25]  Vladimir Pestov,et al.  On the geometry of similarity search: Dimensionality curse and concentration of measure , 1999, Inf. Process. Lett..

[26]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[27]  Charu C. Aggarwal,et al.  Theoretical Foundations and Algorithms for Outlier Ensembles , 2015, SKDD.