Proximity-based k-partitions clustering with ranking for document categorization and analysis

As one of the most fundamental yet important methods of data clustering, center-based partitioning approach clusters the dataset into k subsets, each of which is represented by a centroid or medoid. In this paper, we propose a new medoid-based k-partitions approach called Clustering Around Weighted Prototypes (CAWP), which works with a similarity matrix. In CAWP, each cluster is characterized by multiple objects with different representative weights. With this new cluster representation scheme, CAWP aims to simultaneously produce clusters of improved quality and a set of ranked representative objects for each cluster. An efficient algorithm is derived to alternatingly update the clusters and the representative weights of objects with respect to each cluster. An annealing-like optimization procedure is incorporated to alleviate the local optimum problem for better clustering results and at the same time to make the algorithm less sensitive to parameter setting. Experimental results on benchmark document datasets show that, CAWP achieves favorable effectiveness and efficiency in clustering, and also provides useful information for cluster-specified analysis.

[1]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Piotr Indyk,et al.  Scalable Techniques for Clustering the Web , 2000, WebDB.

[3]  Fakhri Karray,et al.  A concept-based model for enhancing text categorization , 2007, KDD '07.

[4]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[5]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[6]  Raveendran Paramesran,et al.  A hybrid approach for data clustering based on modified cohort intelligence and K-means , 2014, Expert Syst. Appl..

[7]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[8]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[9]  Peter H. A. Sneath,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[10]  Rajesh N. Davé,et al.  Robust fuzzy clustering of relational data , 2002, IEEE Trans. Fuzzy Syst..

[11]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[12]  Andrew Skabar,et al.  Clustering Sentence-Level Text Using a Novel Fuzzy Relational Clustering Algorithm , 2013, IEEE Transactions on Knowledge and Data Engineering.

[13]  Vipin Kumar,et al.  Document Categorization and Query Generation on the World Wide Web Using WebACE , 1999, Artificial Intelligence Review.

[14]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Jian-Ping Mei,et al.  Document clustering around weighted-medoids , 2011, 2011 8th International Conference on Information, Communications & Signal Processing.

[17]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[18]  M. Tahar Kechadi,et al.  Cufres: clustering using fuzzy representative eventsselection for the fault recognition problem intelecommunication networks , 2007, PIKM '07.

[19]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[20]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[21]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[22]  Shi Zhong,et al.  Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[23]  Shie-Jue Lee,et al.  A Similarity Measure for Text Classification and Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[24]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[25]  Michalis Vazirgiannis,et al.  A density-based cluster validity approach using multi-representatives , 2008, Pattern Recognit. Lett..

[26]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Xiaojun Wan,et al.  A novel document similarity measure based on earth mover's distance , 2007, Inf. Sci..

[28]  Hong-Gee Kim,et al.  Exploiting noun phrases and semantic relationships for text document clustering , 2009, Inf. Sci..

[29]  Jian-Ping Mei,et al.  Fuzzy clustering with weighted medoids for relational data , 2010, Pattern Recognit..

[30]  Gerold Schneider,et al.  Learning Document Similarity Using Natural Language Processing , 2013 .

[31]  Shi Zhong,et al.  A Comparative Study of Generative Models for Document Clustering , 2003 .

[32]  Dirk Cattrysse,et al.  Pairwise-adaptive dissimilarity measure for document clustering , 2010, Inf. Sci..

[33]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[34]  Sunghae Jun,et al.  Document clustering method using dimension reduction and support vector clustering to overcome sparseness , 2014, Expert Syst. Appl..

[35]  Xiaotie Deng,et al.  A new suffix tree similarity measure for document clustering , 2007, WWW '07.