Sparse Modeling-Based Sequential Ensemble Learning for Effective Outlier Detection in High-Dimensional Numeric Data

The large proportion of irrelevant or noisy features in reallife high-dimensional data presents a significant challenge to subspace/feature selection-based high-dimensional outlier detection (a.k.a. outlier scoring) methods. These methods often perform the two dependent tasks: relevant feature subset search and outlier scoring independently, consequently retaining features/subspaces irrelevant to the scoring method and downgrading the detection performance. This paper introduces a novel sequential ensemble-based framework SEMSE and its instance CINFO to address this issue. SEMSE learns the sequential ensembles to mutually refine feature selection and outlier scoring by iterative sparse modeling with outlier scores as the pseudo target feature. CINFO instantiates SEMSE by using three successive recurrent components to build such sequential ensembles. Given outlier scores output by an existing outlier scoring method on a feature subset, CINFO first defines a Cantelli’s inequality-based outlier thresholding function to select outlier candidates with a false positive upper bound. It then performs lasso-based sparse regression by treating the outlier scores as the target feature and the original features as predictors on the outlier candidate set to obtain a feature subset that is tailored for the outlier scoring method. Our experiments show that two different outlier scoring methods enabled by CINFO (i) perform significantly better on 11 real-life high-dimensional data sets, and (ii) have much better resilience to noisy features, compared to their bare versions and three state-of-theart competitors. The source code of CINFO is available at https://sites.google.com/site/gspangsite/sourcecode.

[1]  Longbing Cao,et al.  Selective Value Coupling Learning for Detecting Outliers in High-Dimensional Categorical Data , 2017, CIKM.

[2]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Arthur Zimek,et al.  Discriminative features for identifying and interpreting outliers , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[5]  C. O’Brien Statistical Learning with Sparsity: The Lasso and Generalizations , 2016 .

[6]  Ming Shao,et al.  Multi-View Low-Rank Analysis for Outlier Detection , 2015, SDM.

[7]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[8]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[9]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[10]  Ling Chen,et al.  Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[11]  Christos Faloutsos,et al.  Fast and reliable anomaly detection in categorical data , 2012, CIKM.

[12]  Zhi-Quan Luo,et al.  On the linear convergence of the alternating direction method of multipliers , 2012, Mathematical Programming.

[13]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[14]  Heiko Paulheim,et al.  A decomposition of the outlier detection problem into a set of supervised learning problems , 2015, Machine Learning.

[15]  Charu C. Aggarwal,et al.  Outlier ensembles: position paper , 2013, SKDD.

[16]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[18]  Philip S. Yu,et al.  An effective and efficient algorithm for high-dimensional outlier detection , 2005, The VLDB Journal.

[19]  Carla E. Brodley,et al.  FRaC: a feature-modeling approach for semi-supervised and unsupervised anomaly detection , 2012, Data Mining and Knowledge Discovery.

[20]  Ling Chen,et al.  Outlier Detection in Complex Categorical Data by Modeling the Feature Value Couplings , 2016, IJCAI.

[21]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[22]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[23]  Arthur Zimek,et al.  Subsampling for efficient and effective unsupervised outlier detection ensembles , 2013, KDD.

[24]  Philip S. Yu,et al.  Coupled Behavior Analysis with Applications , 2012, IEEE Transactions on Knowledge and Data Engineering.

[25]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[26]  Kai Ming Ting,et al.  LeSiNN: Detecting Anomalies by Identifying Least Similar Nearest Neighbours , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[27]  Karsten M. Borgwardt,et al.  Rapid Distance-Based Outlier Detection via Sampling , 2013, NIPS.

[28]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[29]  Hans-Peter Kriegel,et al.  Interpreting and Unifying Outlier Scores , 2011, SDM.

[30]  Devdatt P. Dubhashi,et al.  Concentration of Measure for the Analysis of Randomized Algorithms: Contents , 2009 .

[31]  Vivekanand Gopalkrishnan,et al.  Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces , 2010, DASFAA.

[32]  Luigi Palopoli,et al.  Detecting outlying properties of exceptional objects , 2009, TODS.

[33]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[34]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[35]  Leman Akoglu,et al.  Sequential Ensemble Learning for Outlier Detection: A Bias-Variance Perspective , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[36]  Ling Chen,et al.  Learning Homophily Couplings from Non-IID Data for Joint Feature Selection and Noise-Resilient Outlier Detection , 2017, IJCAI.

[37]  J. Simonoff,et al.  Procedures for the Identification of Multiple Outliers in Linear Models , 1993 .

[38]  Arthur Zimek,et al.  Ensembles for unsupervised outlier detection: challenges and research questions a position paper , 2014, SKDD.