Document Clustering

Clustering is an automatic learning technique aimed at grouping a set of objects into subsets or clusters. The goal is to create clusters that are coherent internally, but substantially different from each other. In plain words, objects in the same cluster should be as similar as possible, whereas objects in one cluster should be as dissimilar as possible from objects in the other clusters. Automatic document clustering has played an important role in many fields like information retrieval, data mining, etc. The aim of this thesis is to improve the efficiency and accuracy of document clustering. We discuss two clustering algorithms and the fields where these perform better than the known standard clustering algorithms. The first approach is an improvement of the graph partitioning techniques used for document clustering. In this we preprocess the graph using a heuristic and then apply the standard graph partitioning algorithms. This improves the quality of clusters to a great extent. The second approach is a completely different approach in which the words are clustered first and then the word cluster is used to cluster the documents. This reduces the noise in data and thus improves the quality of the clusters. In both these approaches there are parameters which can be changed according to the dataset inorder to improve the quality and efficiency.