An algorithm for mining outliers in categorical data through ranking

The rapid growth in the field of data mining has lead to the development of various methods for outlier detection. Though detection of outliers has been well explored in the context of numerical data, dealing with categorical data is still evolving. In this paper, we propose a two-phase algorithm for detecting outliers in categorical data based on a novel definition of outliers. In the first phase, this algorithm explores a clustering of the given data, followed by the ranking phase for determining the set of most likely outliers. The proposed algorithm is expected to perform better as it can identify different types of outliers, employing two independent ranking schemes based on the attribute value frequencies and the inherent clustering structure in the given data. Unlike some existing methods, the computational complexity of this algorithm is not affected by the number of outliers to be detected. The efficacy of this algorithm is demonstrated through experiments on various public domain categorical data sets.

[1]  Michael K. Ng,et al.  On the Impact of Dissimilarity Measure in k-Modes Clustering Algorithm , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Zengyou He,et al.  A Fast Greedy Algorithm for Outlier Mining , 2005, PAKDD.

[3]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[4]  Sheau-Dong Lang,et al.  Mining Distance-Based Outliers from Categorical Data , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[5]  Hans-Hermann Bock The Classical Data Situation , 2000 .

[6]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[7]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[8]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[9]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[10]  Vikramaditya Jakkula Predictive Data Mining to Learn Health Vitals of a Resident in a Smart Home , 2007 .

[11]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[12]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[13]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[14]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..

[15]  Costas S. Tzafestas,et al.  Maximum Likelihood SLAM in Dynamic Environments , 2007 .

[16]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.

[17]  Ira Assent,et al.  OutRank: ranking outliers in high dimensional data , 2008, 2008 IEEE 24th International Conference on Data Engineering Workshop.

[18]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..