Harmony K-means algorithm for document clustering

Fast and high quality document clustering is a crucial task in organizing information, search engine results, enhancing web crawling, and information retrieval or filtering. Recent studies have shown that the most commonly used partition-based clustering algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm can generate a local optimal solution. In this paper we propose a novel Harmony K-means Algorithm (HKA) that deals with document clustering based on Harmony Search (HS) optimization method. It is proved by means of finite Markov chain theory that the HKA converges to the global optimum. To demonstrate the effectiveness and speed of HKA, we have applied HKA algorithms on some standard datasets. We also compare the HKA with other meta-heuristic and model-based document clustering approaches. Experimental results reveal that the HKA algorithm converges to the best known optimum faster than other methods and the quality of clusters are comparable.

[1]  Witold Pedrycz,et al.  Data Mining Methods for Knowledge Discovery , 1998, IEEE Trans. Neural Networks.

[2]  Andreas Hotho,et al.  Semantic Web Mining: State of the art and future directions , 2006, J. Web Semant..

[3]  M. Fesanghary,et al.  An improved harmony search algorithm for solving optimization problems , 2007, Appl. Math. Comput..

[4]  Günter Rudolph,et al.  Convergence analysis of canonical genetic algorithms , 1994, IEEE Trans. Neural Networks.

[5]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[6]  Nizar Grira,et al.  Unsupervised and Semi-supervised Clustering : a Brief Survey ∗ , 2004 .

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Brian Everitt,et al.  Cluster analysis , 1974 .

[9]  Philip S. Yu,et al.  On the merits of building categorization systems by supervised clustering , 1999, KDD '99.

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Thomas E. Potok,et al.  Document clustering using particle swarm optimization , 2005, Proceedings 2005 IEEE Swarm Intelligence Symposium, 2005. SIS 2005..

[12]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[13]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[14]  Andries Petrus Engelbrecht,et al.  Data clustering using particle swarm optimization , 2003, The 2003 Congress on Evolutionary Computation, 2003. CEC '03..

[15]  C. Coello,et al.  CONSTRAINT-HANDLING USING AN EVOLUTIONARY MULTIOBJECTIVE OPTIMIZATION TECHNIQUE , 2000 .

[16]  Joydeep Ghosh,et al.  A Unified Framework for Model-based Clustering , 2003, J. Mach. Learn. Res..

[17]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[18]  Andries P. Engelbrecht,et al.  Image Classification using Particle Swarm Optimization , 2002, SEAL.

[19]  Zong Woo Geem,et al.  Harmony Search Optimization: Application to Pipe Network Design , 2002 .

[20]  B. Everitt,et al.  Cluster Analysis (2nd ed). , 1982 .

[21]  Gareth Jones,et al.  Non-hierarchic document clustering using a genetic algorithm , 1995, Information Research.

[22]  FayyadUsama,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005 .

[23]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[24]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[25]  Richard C. Dubes,et al.  Experiments in projection and clustering by simulated annealing , 1989, Pattern Recognit..

[26]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[27]  K. Lee,et al.  A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice , 2005 .

[28]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[29]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[30]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[32]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[33]  Vijay V. Raghavan,et al.  A clustering strategy based on a formalism of the reproductive process in natural systems , 1979, SIGIR '79.

[34]  Vijay V. Raghavan,et al.  A clustering strategy based on a formalism of the reproductive process in natural systems , 1979, SIGIR 1979.

[35]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[36]  Nicolas Monmarché,et al.  AntClust: Ant Clustering and Web Usage Mining , 2003, GECCO.

[37]  Shi Zhong,et al.  Semi-supervised model-based document clustering: A comparative study , 2006, Machine Learning.

[38]  Zong Woo Geem,et al.  Harmony Search for Generalized Orienteering Problem: Best Touring in China , 2005, ICNC.