Document Clustering – A Feasible Demonstration with K-means Algorithm

The manual structural organization of documents is expensive in terms of time and efforts. Traversing large number of documents to interpret manually is also challenging issue. Therefore, sophisticated means are needed to cope with this challenge. Clustering is one of the automated solutions. It is a major tool in many applications of business and data sciences. Document clustering sorts out records into various gatherings called as groups, where the documents in each group share some regular properties as indicated in closeness or similarity measure. Robust document clustering assumes an essential role in helping its users to successfully explore, condense, and sort out the data. This paper aims at clustering textual documents using TF-IDF (Term Frequency – Inverse Document Frequency) scheme. This research proposed methods for the selection of initial centroids in k-means clustering algorithm, which reduces efforts to great extent by minimizing the number of iterations usually one and efficiently ensures the accuracy of obtained clusters. Besides, the proposed methods expressed potentially promising results for small document sets.

[1]  Sunghae Jun,et al.  Document clustering method using dimension reduction and support vector clustering to overcome sparseness , 2014, Expert Syst. Appl..

[2]  Muhammad Ali Ramdhani,et al.  Clustering the Verses of the Holy Qur'an using K-Means Algorithm , 2016 .

[3]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[4]  Anna-Lan Huang,et al.  Similarity Measures for Text Document Clustering , 2008 .

[5]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[6]  Jed O. Kaplan,et al.  The application and interpretation of Keeling plots in terrestrial carbon cycle research , 2003 .

[7]  David J. Schwab,et al.  The Information Bottleneck and Geometric Clustering , 2017, Neural Computation.

[8]  Vijay Kumar Verma,et al.  Text mining and information professionals: Role, issues and challenges , 2015, 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services.

[9]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[10]  Neepa Shah,et al.  Document Clustering: A Detailed Review , 2012 .

[11]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[12]  Yanchun Zhang,et al.  Semi-Supervised Collective Matrix Factorization for Topic Detection and Document Clustering , 2017, 2017 IEEE Second International Conference on Data Science in Cyberspace (DSC).

[13]  Chien-Liang Liu,et al.  Clustering tagged documents with labeled and unlabeled documents , 2013, Inf. Process. Manag..

[14]  Ran El-Yaniv,et al.  Iterative Double Clustering for Unsupervised and Semi-supervised Learning , 2001, ECML.