Outlier Detection in Categorical Data

This chapter delves on a specific research issue connected with outlier detection problem, namely type of data attributes. More specifically, the case of analyzing data described using categorical attributes/features is presented here. It is known that the performance of a detection algorithm directly depends on the way outliers are perceived. Typically, categorical data are processed by considering the occurrence frequencies of various attributes values. Accordingly, the objective here is to characterize the deviating nature of data objects with respect to individual attributes as well as in the joint distribution of two or more attributes. This can be achieved by defining the measure of deviation in terms of the attribute value frequencies. Also, cluster analysis provides valuable insights on the inherent grouping structure of the data that helps in identifying the deviating objects. Based on this understanding, this chapter presents algorithms developed for detection of outliers in categorical data.

[1]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[3]  Vikramaditya Jakkula Predictive Data Mining to Learn Health Vitals of a Resident in a Smart Home , 2007 .

[4]  M. Narasimha Murty,et al.  An algorithm for mining outliers in categorical data through ranking , 2012, 2012 12th International Conference on Hybrid Intelligent Systems (HIS).

[5]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[6]  Costas S. Tzafestas,et al.  Maximum Likelihood SLAM in Dynamic Environments , 2007 .

[7]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[8]  Ying Liu,et al.  Cluster-based outlier detection , 2009, Ann. Oper. Res..

[9]  QunHui Wu,et al.  Detecting outliers in sliding window over categorical data streams , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[10]  Vipin Kumar,et al.  Anomaly Detection for Discrete Sequences: A Survey , 2012, IEEE Transactions on Knowledge and Data Engineering.

[11]  Shuxin Li,et al.  Mining Distance-Based Outliers from Categorical Data , 2007 .

[12]  Santanu Santra,et al.  A genetic approach for efficient outlier detection in projected space , 2008, Pattern Recognit..

[13]  G. Athithan,et al.  Data Mining Techniques for Outlier Detection , 2011 .

[14]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[15]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[16]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[17]  He Zengyou,et al.  Squeezer: an efficient algorithm for clustering categorical data , 2002 .

[18]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[19]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[20]  Vipin Kumar,et al.  Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[21]  Ira Assent,et al.  OutRank: ranking outliers in high dimensional data , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[22]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[23]  Ayman Taha,et al.  A proposed outliers identification algorithm for categorical data sets , 2010, 2010 The 7th International Conference on Informatics and Systems (INFOS).

[24]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007 .

[26]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..

[27]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[28]  Shengrui Wang,et al.  Information-Theoretic Outlier Detection for Large-Scale Categorical Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[29]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[30]  M. Narasimha Murty,et al.  A ranking-based algorithm for detection of outliers in categorical data , 2014, Int. J. Hybrid Intell. Syst..

[31]  Zengyou He,et al.  A Fast Greedy Algorithm for Outlier Mining , 2005, PAKDD.

[32]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[33]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..