Learning Homophily Couplings from Non-IID Data for Joint Feature Selection and Noise-Resilient Outlier Detection

This paper introduces a novel wrapper-based outlier detection framework (WrapperOD) and its instance (HOUR) for identifying outliers in noisy data (i.e., data with noisy features) with strong couplings between outlying behaviors. Existing subspace or feature selection-based methods are significantly challenged by such data, as their search of feature subset(s) is independent of outlier scoring and thus can be misled by noisy features. In contrast, HOUR takes a wrapper approach to iteratively optimize the feature subset selection and outlier scoring using a top-k outlier ranking evaluation measure as its objective function. HOUR learns homophily couplings between outlying behaviors (i.e., abnormal behaviors are not independent - they bond together) in constructing a noise-resilient outlier scoring function to produce a reliable outlier ranking in each iteration. We show that HOUR (i) retains a 2-approximation outlier ranking to the optimal one; and (ii) significantly outperforms five state-of-the-art competitors on 15 real-world data sets with different noise levels in terms of AUC and/or [email protected] The source code of HOUR is available at https://sites.google.com/site/gspangsite/sourcecode.

[1]  M Duch,et al.  [Information processing management in nursing units]. , 1988, Revista de enfermeria.

[2]  Christos Faloutsos,et al.  Polonium: Tera-Scale Graph Mining and Inference for Malware Detection , 2011 .

[3]  Zengyou He,et al.  FP-outlier: Frequent pattern based outlier detection , 2005, Comput. Sci. Inf. Syst..

[4]  Arthur Zimek,et al.  On the internal evaluation of unsupervised outlier detection , 2015, SSDBM.

[5]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[6]  Philip S. Yu,et al.  Coupled Behavior Analysis with Applications , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  Yizhou Sun,et al.  Entity Embedding-Based Anomaly Detection for Heterogeneous Categorical Events , 2016, IJCAI.

[8]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[9]  Richard T. Snodgrass,et al.  ACM Transactions on Database Systems: Editorial , 2001 .

[10]  Luigi Palopoli,et al.  Detecting outlying properties of exceptional objects , 2009, TODS.

[11]  Dongjoon Kong,et al.  A New Feature Selection Method for One-Class Classification Problems , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[12]  Tin Kam Ho,et al.  Complexity Measures of Supervised Classification Problems , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Filter Feature Selection for One-Class Classification , 2014, Journal of Intelligent & Robotic Systems.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  Jingrui He,et al.  Coselection of features and instances for unsupervised rare category analysis , 2010, Stat. Anal. Data Min..

[16]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[17]  Longbing Cao,et al.  Coupling learning of complex interactions , 2015, Inf. Process. Manag..

[18]  Huidong Jin,et al.  ZERO++: Harnessing the Power of Zero Appearances to Detect Anomalies in Large-Scale Data Sets , 2016, J. Artif. Intell. Res..

[19]  Ling Chen,et al.  Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings , 2016, 2016 IEEE 16th International Conference on Data Mining (ICDM).

[20]  Christos Faloutsos,et al.  Fast and reliable anomaly detection in categorical data , 2012, CIKM.

[21]  Cordelia Schmid,et al.  Approximate Fisher Kernels of Non-iid Image Models for Image Categorization , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[23]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[26]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[27]  Antonio González Muñoz,et al.  A Set of Complexity Measures Designed for Applying Meta-Learning to Instance Selection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[28]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[29]  Ling Chen,et al.  Outlier Detection in Complex Categorical Data by Modeling the Feature Value Couplings , 2016, IJCAI.

[30]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Rynson W. H. Lau,et al.  Knowledge and Data Engineering for e-Learning Special Issue of IEEE Transactions on Knowledge and Data Engineering , 2008 .

[32]  Longbing Cao,et al.  Non-IIDness Learning in Behavioral and Social Data , 2014, Comput. J..