论文信息 - Scaling k-medoid algorithm for clustering large categorical dataset and its performance analysis

Scaling k-medoid algorithm for clustering large categorical dataset and its performance analysis

Scalable data mining algorithms have become crucial to efficiently support KDD processes on large datasets. The k-medoid is one of the partitioning algorithms used for the purpose of clustering. We show that basic k-medoid algorithm is very much time consuming for large dataset. Instead we present the advanced algorithm which performs much better than known algorithm. In addition to presenting detailed experimental results for advanced k-medoid algorithm, we also conduct an experimental study with real life data sets to demonstrate the effectiveness of our technique. We address the task of scaling up k-medoids based algorithm through the utilization of memoization technique. Experimental results based on several datasets, including synthetic and real data, show that the proposed algorithm may reduce the number of distance calculations by a factor of more lhan a thousand limes when compared to existing algorithms while producing clusters of comparable quality.

Ritesh Joshi | Anil Patidar | Surendra Mishra

[1] Joshua Zhexue Huang,et al. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[2] Lipika Dey,et al. A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set , 2007, Pattern Recognit. Lett..

[3] Zhang Yi,et al. Clustering Categorical Data , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4] Guy E. Blelloch,et al. Selective memoization , 2003, POPL '03.

[5] Vipin Kumar,et al. Similarity Measures for Categorical Data: A Comparative Evaluation , 2008, SDM.

[6] Agma J. M. Traina,et al. An Efficient Approach to Scale up k-medoid based Algorithms in Large Databases , 2006, SBBD.

[7] Yao Wang,et al. A robust and scalable clustering algorithm for mixed type attributes in large database environment , 2001, KDD '01.