Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings

Proper feature selection for unsupervised outlier detection can improve detection performance but is very challenging due to complex feature interactions, the mixture of relevant features with noisy/redundant features in imbalanced data, and the unavailability of class labels. Little work has been done on this challenge. This paper proposes a novel Coupled Unsupervised Feature Selection framework (CUFS for short) to filter out noisy or redundant features for subsequent outlier detection in categorical data. CUFS quantifies the outlierness (or relevance) of features by learning and integrating both the feature value couplings and feature couplings. Such value-to-feature couplings capture intrinsic data characteristics and distinguish relevant features from those noisy/redundant features. CUFS is further instantiated into a parameter-free Dense Subgraph-based Feature Selection method, called DSFS. We prove that DSFS retains a 2-approximation feature subset to the optimal subset. Extensive evaluation results on 15 real-world data sets show that DSFS obtains an average 48% feature reduction rate, and enables three different types of pattern-based outlier detection methods to achieve substantially better AUC improvements and/or perform orders of magnitude faster than on the original feature set. Compared to its feature selection contender, on average, all three DSFS-based detectors achieve more than 20% AUC improvement.

[1]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[2]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[3]  Kai Ming Ting,et al.  LeSiNN: Detecting Anomalies by Identifying Least Similar Nearest Neighbours , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[4]  Samir Khuller,et al.  On Finding Dense Subgraphs , 2009, ICALP.

[5]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[6]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[7]  Longbing Cao,et al.  Non-IIDness Learning in Behavioral and Social Data , 2014, Comput. J..

[8]  Longbing Cao,et al.  Coupling learning of complex interactions , 2015, Inf. Process. Manag..

[9]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[10]  S YuPhilip,et al.  Outlier detection for high dimensional data , 2001 .

[11]  Jian Pei,et al.  Towards Scalable and Accurate Online Feature Selection for Big Data , 2014, 2014 IEEE International Conference on Data Mining.

[12]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[13]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[14]  Richard Weber,et al.  Feature selection for high-dimensional class-imbalanced data sets using Support Vector Machines , 2014, Inf. Sci..

[15]  Charu C. Aggarwal Outlier Detection in Categorical, Text and Mixed Attribute Data , 2013 .

[16]  James Bailey,et al.  Mining multidimensional contextual outliers from categorical relational data , 2013, SSDBM.

[17]  Philip S. Yu,et al.  Coupled Behavior Analysis with Applications , 2012, IEEE Transactions on Knowledge and Data Engineering.

[18]  Jennifer G. Dy,et al.  GPU-Accelerated Feature Selection for Outlier Detection Using the Local Kernel Density Ratio , 2012, 2012 IEEE 12th International Conference on Data Mining.

[19]  Fei Zhou,et al.  Coupled Attribute Similarity Learning on Categorical Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[20]  N. Christakis,et al.  The Spread of Obesity in a Large Social Network Over 32 Years , 2007, The New England journal of medicine.

[21]  Arthur Zimek,et al.  Subsampling for efficient and effective unsupervised outlier detection ensembles , 2013, KDD.

[22]  Jingrui He,et al.  Coselection of features and instances for unsupervised rare category analysis , 2010, Stat. Anal. Data Min..

[23]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[24]  Jilles Vreeken,et al.  The Odd One Out: Identifying and Characterising Anomalies , 2011, SDM.

[25]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[26]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[27]  Ling Chen,et al.  Outlier Detection in Complex Categorical Data by Modeling the Feature Value Couplings , 2016, IJCAI.

[28]  Xue-wen Chen,et al.  FAST: a roc-based feature selection metric for small samples and imbalanced data classification problems , 2008, KDD.

[29]  Zengyou He,et al.  FP-outlier: Frequent pattern based outlier detection , 2005, Comput. Sci. Inf. Syst..

[30]  Christos Faloutsos,et al.  Graph mining: Laws, generators, and algorithms , 2006, CSUR.

[31]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[32]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Shengrui Wang,et al.  Information-Theoretic Outlier Detection for Large-Scale Categorical Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[34]  Liang Du,et al.  Unsupervised Feature Selection with Adaptive Structure Learning , 2015, KDD.

[35]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[36]  Christos Faloutsos,et al.  Fast and reliable anomaly detection in categorical data , 2012, CIKM.

[37]  Moses Charikar,et al.  Greedy approximation algorithms for finding dense components in a graph , 2000, APPROX.