A new density-based subspace selection method using mutual information for high dimensional outlier detection

Abstract Outlier detection in high dimensional data faces the challenge of curse of dimensionality, where irrelevant features may prevent detection of outliers. In this research, we propose a novel efficient unsupervised density-based subspace selection for outlier detection in the projected subspace. First, the Maximum-Relevance-to-Density algorithm(MRD) is proposed to select the relevant subspace based on the mutual information. Then, applying the concept of redundancy among features, we present an efficient relevant subspace selection method called minimum-Redundancy-Maximum-Relevance-to-Density (mRMRD). Finally, the degree of outlierness of data points in the corresponding relevant subspace is computed based on Local Outlier Factor(LOF). Experimental results on both real and synthetic data demonstrate that the proposed algorithms – based on MRD and mRMRD criteria – increase the accuracy of outlier detection while reducing computational complexity and execution time. Moreover, as the dimensionality increases, the accuracy of outlier detection on mRMRD-based relevant subspace is higher than MRD-based relevant subspace. This verifies that the proposed mRMRD-based subspace selection algorithm can efficiently select the subspace by considering the relevance between features.

[1]  Klemens Böhm,et al.  Dimension-based subspace search for outlier detection , 2018, International Journal of Data Science and Analytics.

[2]  Maurizio Filippone,et al.  A comparative evaluation of outlier detection algorithms: Experiments and analyses , 2018, Pattern Recognit..

[3]  Huawen Liu,et al.  Recent Progress of Anomaly Detection , 2019, Complex..

[4]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[5]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[6]  Christopher Leckie,et al.  High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning , 2016, Pattern Recognit..

[7]  Hans-Peter Kriegel,et al.  Outlier Detection in Arbitrarily Oriented Subspaces , 2012, 2012 IEEE 12th International Conference on Data Mining.

[8]  Xiao Qin,et al.  Scalable Mining of Contextual Outliers Using Relevant Subspace , 2020, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[9]  Bijan Raahemi,et al.  Subspace selection in high-dimensional big data using genetic algorithm in apache spark , 2017, ICC.

[10]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[11]  P. Premchand,et al.  Automatic Incremental Clustering Using Bat-Grey Wolf Optimizer-Based MapReduce Framework for Effective Management of High-Dimensional Data , 2020, Int. J. Ambient Comput. Intell..

[12]  VARUN CHANDOLA,et al.  Outlier Detection : A Survey , 2007 .

[13]  Richard Bellman,et al.  High-dimensional Outlier Detection: the Subspace Method , 2017 .

[14]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[15]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[16]  Kate Smith-Miles,et al.  On normalization and algorithm selection for unsupervised outlier detection , 2019, Data Mining and Knowledge Discovery.

[17]  Xuelong Li,et al.  Efficient Outlier Detection for High-Dimensional Data , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[18]  Bu Sung Lee Francis,et al.  Combining MIC feature selection and feature-based MSPCA for network traffic anomaly detection , 2016, 2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC).

[19]  Mengjie Zhang,et al.  Differential evolution for filter feature selection based on information theory and feature ranking , 2018, Knowl. Based Syst..

[20]  Peng Song,et al.  Scalable KDE-based top-n local outlier detection over large-scale data streams , 2020, Knowl. Based Syst..

[21]  Xiao Qin,et al.  Parallel mining of contextual outlier using sparse subspace , 2019, Expert Syst. Appl..

[22]  Hans-Peter Kriegel,et al.  A survey on unsupervised outlier detection in high‐dimensional numerical data , 2012, Stat. Anal. Data Min..

[23]  P M Ameer,et al.  Anamoly Detection in Wireless Sensor Networks , 2019, TENCON 2019 - 2019 IEEE Region 10 Conference (TENCON).

[24]  Nilanjan Dey,et al.  A Survey of Data Mining and Deep Learning in Bioinformatics , 2018, Journal of Medical Systems.

[25]  Charu C. Aggarwal,et al.  Outlier Detection for Text Data , 2017, SDM.

[26]  Minping Jia,et al.  Intelligent fault diagnosis of rotating machinery using improved multiscale dispersion entropy and mRMR feature selection , 2019, Knowl. Based Syst..

[27]  Jugal K. Kalita,et al.  A multi-step outlier-based anomaly detection approach to network-wide traffic , 2016, Inf. Sci..

[28]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[29]  Xiao Qin,et al.  A relevant subspace based contextual outlier mining algorithm , 2016, Knowl. Based Syst..

[30]  Jaideep Srivastava,et al.  A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[31]  Nilanjan Dey,et al.  Applied Machine Learning for Smart Data Analysis , 2019 .

[32]  Klemens Böhm,et al.  CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection , 2013, SDM.

[33]  Sulan Zhang,et al.  A concept lattice based outlier mining method in low-dimensional subspaces , 2009, Pattern Recognit. Lett..

[34]  Ling Chen,et al.  Sparse Modeling-Based Sequential Ensemble Learning for Effective Outlier Detection in High-Dimensional Numeric Data , 2018, AAAI.

[35]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[36]  Philip S. Yu,et al.  An effective and efficient algorithm for high-dimensional outlier detection , 2005, The VLDB Journal.

[37]  Emmanuel Müller,et al.  Statistical selection of relevant subspace projections for outlier ranking , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[38]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[39]  Ashish Ghosh,et al.  Integration of deep feature extraction and ensemble learning for outlier detection , 2019, Pattern Recognit..

[40]  Xiao Qin,et al.  LOMA: A local outlier mining algorithm based on attribute relevance analysis , 2017, Expert Syst. Appl..

[41]  Elke Achtert,et al.  ELKI: A Software System for Evaluation of Subspace Clustering Algorithms , 2008, SSDBM.

[42]  Li Li,et al.  A Comparison of Outlier Detection Techniques for High-Dimensional Data , 2018, Int. J. Comput. Intell. Syst..

[43]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[45]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..