Outliers in High Dimensional Data

This chapter addresses one of the research issues connected with the outlier detection problem, namely dimensionality of the data. More specifically, the focus is on detecting outliers embedded in subspaces of high dimensional categorical data. To this effect, some algorithms for unsupervised selection of feature subsets in categorical data domain are furnished here. A detailed discussion on devising necessary measures for assessing the relevance and redundancy of categorical attributes/features is presented. Experimental study of these algorithms on benchmark categorical data sets explores the efficacy of these algorithms towards outlier detection.

[1]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[2]  Masashi Sugiyama,et al.  High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso , 2012, Neural Computation.

[3]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[4]  Hans-Peter Kriegel,et al.  Outlier Detection in Arbitrarily Oriented Subspaces , 2012, 2012 IEEE 12th International Conference on Data Mining.

[5]  Xuelong Li,et al.  Efficient Outlier Detection for High-Dimensional Data , 2018, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[6]  James F. Peters,et al.  Feature Selection: Near Set Approach , 2007, MCD.

[7]  Taghi M. Khoshgoftaar,et al.  Feature Selection with Imbalanced Data for Software Defect Prediction , 2009, 2009 International Conference on Machine Learning and Applications.

[8]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ira Assent,et al.  OutRank: ranking outliers in high dimensional data , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[10]  Rasmus Pagh,et al.  A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data , 2012, KDD.

[11]  Klemens Böhm,et al.  Outlier Ranking via Subspace Analysis in Multiple Views of the Data , 2012, 2012 IEEE 12th International Conference on Data Mining.

[12]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[13]  Costas S. Tzafestas,et al.  Maximum Likelihood SLAM in Dynamic Environments , 2007 .

[14]  Vivekanand Gopalkrishnan,et al.  Feature Extraction for Outlier Detection in High-Dimensional Spaces , 2010, FSDM.

[15]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[16]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[17]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[18]  Ata Kabán Fractional Norm Regularization: Learning With Very Few Relevant Features , 2013, IEEE Transactions on Neural Networks and Learning Systems.

[19]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[20]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007 .

[21]  B. Chandrasekaran,et al.  On dimensionality and sample size in statistical pattern classification , 1971, Pattern Recognit..

[22]  Jacek M. Zurada,et al.  Normalized Mutual Information Feature Selection , 2009, IEEE Transactions on Neural Networks.

[23]  Masashi Sugiyama,et al.  Sufficient Dimension Reduction via Squared-Loss Mutual Information Estimation , 2010, Neural Computation.

[24]  Ashish Ghosh,et al.  Entropy based region selection for moving object detection , 2011, Pattern Recognit. Lett..

[25]  R. Bellman Dynamic programming. , 1957, Science.

[26]  G. Hughes,et al.  Number of pattern classifier design samples per class (Corresp.) , 1969, IEEE Trans. Inf. Theory.

[27]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[28]  Ji Zhang,et al.  Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance , 2006, Knowledge and Information Systems.

[29]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[30]  Ajith Abraham,et al.  A new weighted rough set framework for imbalance class distribution , 2010, 2010 International Conference of Soft Computing and Pattern Recognition.

[31]  Klemens Böhm,et al.  CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection , 2013, SDM.

[32]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[33]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[34]  K. S. Chaudhuri,et al.  A combined approach to tackle imbalanced data sets , 2012, Int. J. Hybrid Intell. Syst..

[35]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[37]  M. Narasimha Murty,et al.  Unsupervised feature selection for outlier detection in categorical data using mutual information , 2012, 2012 12th International Conference on Hybrid Intelligent Systems (HIS).