A simple and effective outlier detection algorithm for categorical data

Outlier detection is an important data mining task that has attracted substantial attention within diverse research communities and the areas of application. By now, many techniques have been developed to detect outliers. However, most existing research focus on numerical data. And they can not directly apply to categorical data because of the difficulty of defining a meaningful similarity measure for categorical data. In this paper, a weighted density definition is given firstly, which takes account of the density and uncertainty of objects in every attributes simultaneously. Furthermore, a simple and effective outlier detection algorithm for categorical data based on the given weighted density is proposed. The corresponding time complexity of the algorithm is analyzed as well. Experimental results on real and synthetic data sets demonstrate the effectiveness and efficiency of our proposed algorithm.

[1]  Kwang-Ho Ro,et al.  Outlier detection for high-dimensional data , 2015 .

[2]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[3]  Zengyou He,et al.  A Fast Greedy Algorithm for Outlier Mining , 2005, PAKDD.

[4]  Cungen Cao,et al.  Some issues about outlier detection in rough set theory , 2009, Expert Syst. Appl..

[5]  Zengyou He,et al.  FP-outlier: Frequent pattern based outlier detection , 2005, Comput. Sci. Inf. Syst..

[6]  Jiye Liang,et al.  Ieee Transactions on Knowledge and Data Engineering 1 a Group Incremental Approach to Feature Selection Applying Rough Set Technique , 2022 .

[7]  Cuiping Wei,et al.  An Atanassov’s intuitionistic fuzzy multi-attribute group decision making method based on entropy and similarity measure , 2014, Int. J. Mach. Learn. Cybern..

[8]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[9]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[10]  Jiye Liang,et al.  Information entropy, rough entropy and knowledge granulation in incomplete information systems , 2006, Int. J. Gen. Syst..

[11]  Jiye Liang,et al.  Measures for evaluating the decision performance of a decision table in rough set theory , 2008, Inf. Sci..

[12]  Hong Yan,et al.  A hierarchical multilevel thresholding method for edge information extraction using fuzzy entropy , 2011, International Journal of Machine Learning and Cybernetics.

[13]  Witold Pedrycz,et al.  Positive approximation: An accelerator for attribute reduction in rough set theory , 2010, Artif. Intell..

[14]  Nada Lavrac,et al.  Experiments with Noise Filtering in a Medical Domain , 1999, ICML.

[15]  Cungen Cao,et al.  A rough set approach to outlier detection , 2008, Int. J. Gen. Syst..

[16]  Wei Xu,et al.  New fuzzy c-means clustering model based on the data weighted approach , 2010, Data Knowl. Eng..

[17]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[18]  Vipin Kumar,et al.  Parallel and Distributed Computing for Cybersecurity , 2005, IEEE Distributed Syst. Online.

[19]  Zengyou He,et al.  An Optimization Model for Outlier Detection in Categorical Data , 2005, ICIC.

[20]  Jiye Liang,et al.  Determining the number of clusters using information entropy for mixed data , 2012, Pattern Recognit..

[21]  Jiye Liang,et al.  A new method for measuring uncertainty and fuzziness in rough set theory , 2002, Int. J. Gen. Syst..

[22]  Cungen Cao,et al.  An information entropy-based approach to outlier detection in rough sets , 2010, Expert Syst. Appl..

[23]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..

[24]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[25]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[26]  Dan A. Simovici,et al.  Finding Median Partitions Using Information-Theoretical-Based Genetic Algorithms , 2002, J. Univers. Comput. Sci..

[27]  Xu Zhang,et al.  A Quick Attribute Reduction Algorithm with Complexity of max(O(|C||U|),O(|C|~2|U/C|)) , 2006 .

[28]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[29]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[30]  Shuxin Li,et al.  Mining Distance-Based Outliers from Categorical Data , 2007 .

[31]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[32]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[33]  Jiye Liang,et al.  A weighting k-modes algorithm for subspace clustering of categorical data , 2013, Neurocomputing.