Distributed document clustering analysis based on a hybrid method

Clustering is one of the recently challenging tasks since there is an ever-growing amount of data in scientific research and commercial applications. High quality and fast document clustering algorithms are in great demand to deal with large volume of data. The computational requirements for bringing such growing amount data to a central site for clustering are complex. The proposed algorithm uses optimal centroids for K-Means clustering based on Particle Swarm Optimization(PSO). PSO is used to take advantage of its global search ability to provide optimal centroids which aids in generating more compact clusters with improved accuracy. This proposed methodology utilizes Hadoop and MapReduce framework which provides distributed storage and analysis to support data intensive distributed applications. Experiments were performed on Reuter’s and RCV1 document dataset which shows an improvement in accuracy with reduced execution time.