A Hybrid Approach for Web Document Clustering Using K-means and Artificial Bee Colony Algorithm

Nowadays data growth is directly proportional to time and it is a major challenge to store the data in an organised fashion. Document clustering is the solution for organising relevant documents together. In this paper, a web clustering algorithm namely WDC-KABC is proposed to cluster the web documents effectively. The proposed algorithm uses the features of both K-means and Artificial Bee Colony (ABC) clustering algorithm. In this paper, ABC algorithm is employed as the global search optimizer and K-means is used for refining the solutions. Thus, the quality of the cluster is improved. The performance of WDC-KABC is analysed with four different datasets (webkb, wap, rec0 and 7sectors). The proposed algorithm is compared with existing algorithms such as K-means, Particle Swarm Optimization, Hybrid of Particle Swarm Optimization and K-means and Ant Colony Optimization. The experimental results of WDC-KABC are satisfactory, in terms of precision, recall, f-measure, accuracy and error rate.

[1]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[2]  Dawid Weiss,et al.  A survey of Web clustering engines , 2009, CSUR.

[3]  Dmitri V. Kalashnikov,et al.  Web People Search via Connection Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[4]  Filippo Attivissimo,et al.  An automatic document processing system for medical data extraction , 2015 .

[5]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[6]  Khaled M. Hammouda Web Mining : Clustering Web Documents A Preliminary Review , 2001 .

[7]  Rowena Cole,et al.  Clustering with genetic algorithms , 1998 .

[8]  Chih-Cheng Hung,et al.  Hybridization of the Ant Colony Optimization with the K-Means Algorithm for Clustering , 2005, SCIA.

[9]  Mohamed S. Kamel,et al.  Efficient phrase-based document indexing for Web document clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[10]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[11]  Amit Konar,et al.  Automatic kernel clustering with a Multi-Elitist Particle Swarm Optimization Algorithm , 2008, Pattern Recognit. Lett..

[12]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[13]  Chris H. Q. Ding,et al.  Weighted Consensus Clustering , 2008, SDM.

[14]  Victor J. Rayward-Smith,et al.  Metaheuristics for clustering in KDD , 2005, 2005 IEEE Congress on Evolutionary Computation.

[15]  Amit Konar,et al.  Automatic image pixel clustering with an improved differential evolution , 2009, Appl. Soft Comput..

[16]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[17]  E. Lieb,et al.  Analysis, Second edition , 2001 .

[18]  Shuting Xu,et al.  A Parallel Hybrid Web Document Clustering Algorithm and its Performance Study , 2004, The Journal of Supercomputing.

[19]  Gerald Salton,et al.  Automatic text processing , 1988 .

[20]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[21]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[22]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[23]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[24]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[25]  Prasanta K. Jana,et al.  Initialization for K-means Clustering using Voronoi Diagram , 2012 .

[26]  Thomas E. Potok,et al.  Document clustering using particle swarm optimization , 2005, Proceedings 2005 IEEE Swarm Intelligence Symposium, 2005. SIS 2005..

[27]  Martti Juhola,et al.  On principal component analysis, cosine and Euclidean measures in information retrieval , 2007, Inf. Sci..

[28]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[29]  I. Jolliffe Principal Component Analysis , 2002 .

[30]  Dervis Karaboga,et al.  A novel clustering approach: Artificial Bee Colony (ABC) algorithm , 2011, Appl. Soft Comput..

[31]  Dervis Karaboga,et al.  A comprehensive survey: artificial bee colony (ABC) algorithm and applications , 2012, Artificial Intelligence Review.

[32]  Brian Everitt,et al.  Cluster analysis , 1974 .

[33]  Ramiz M. Aliguliyev,et al.  Performance evaluation of density-based clustering methods , 2009, Inf. Sci..

[34]  Dervis Karaboga,et al.  AN IDEA BASED ON HONEY BEE SWARM FOR NUMERICAL OPTIMIZATION , 2005 .

[35]  Charu C. Aggarwal,et al.  Data Clustering , 2013 .