Semantic based Document Clustering: A Detailed Review

Document clustering, one of the traditional data mining techniques, is an unsupervised learning paradigm where clustering methods try to identify inherent groupings of the text documents, so that a set of clusters is produced in which clusters exhibit high intra-cluster similarity and low intercluster similarity. The importance of document clustering emerges from the massive volumes of textual documents created. Although numerous document clustering methods have been extensively studied in these years, there still exist several challenges for increasing the clustering quality. Particularly, most of the current document clustering algorithms does not consider the semantic relationships which produce unsatisfactory clustering results. Since last three-four years efforts have been seen in applying semantics to document clustering. Here, an exhaustive and detailed review of more than thirty semantic driven document clustering methods is presented. After an introduction to the document clustering and its basic requirements for improvement, traditional algorithms are overviewed. Also, semantic similarity measures are explained. The article then discusses algorithms that make semantic interpretation of documents for clustering. The semantic approach applied, datasets used, evaluation parameters applied, limitations and future work of all these approaches is presented in tabular format for easy and quick interpretation.

[1]  K. Lind,et al.  Concept Based Document Clustering using a Simplicial Complex, a Hypergraph , 2006 .

[2]  Timo Honkela,et al.  Learning a taxonomy from a set of text documents , 2012, Appl. Soft Comput..

[3]  Fakhri Karray,et al.  An Efficient Concept-Based Mining Model for Enhancing Text Clustering , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  Steffen Staab,et al.  Ontology-based Text Document Clustering , 2002, Künstliche Intell..

[5]  Giansalvatore Mecca,et al.  A new algorithm for clustering search results , 2007, Data Knowl. Eng..

[6]  D. Manimegalai,et al.  Query based Text Document Clustering using its Hypernymy Relation , 2011 .

[7]  Jiawei Han,et al.  Document clustering using locality preserving indexing , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Wei Song,et al.  Genetic algorithm for text clustering based on latent semantic indexing , 2009, Comput. Math. Appl..

[9]  Soon Myoung Chung,et al.  Text document clustering based on frequent word meaning sequences , 2008, Data Knowl. Eng..

[10]  Muhammad Rafi,et al.  Document Clustering based on Topic Maps , 2010, ArXiv.

[11]  José Ranilla,et al.  Scoring and selecting terms for text categorization , 2005, IEEE Intelligent Systems.

[12]  Frank S. C. Tseng,et al.  An integration of WordNet and fuzzy association rule mining for multi-label document clustering , 2010, Data Knowl. Eng..

[13]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[14]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.

[15]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[16]  Hai-Tao Zheng,et al.  GOClonto: An ontological clustering approach for conceptualizing PubMed abstracts , 2010, J. Biomed. Informatics.

[17]  Renu Dhir,et al.  A Frequent Concepts Based Document Clustering Algorithm , 2010 .

[18]  Hong-Gee Kim,et al.  Exploiting noun phrases and semantic relationships for text document clustering , 2009, Inf. Sci..

[19]  Carlotta Domeniconi,et al.  LOCAL SEMANTIC KERNELS FOR TEXT DOCUMENT CLUSTERING , 2007 .

[20]  Stefan Wermter,et al.  Hybrid neural document clustering using guided self-organization and WordNet , 2004, IEEE Intelligent Systems.

[21]  M. Punithavalli,et al.  Performance Evaluation of Semantic Based and Ontology Based Text Document Clustering Techniques , 2012 .

[22]  Youngmihn Kim Document clustering in a learned concept space , 2010 .

[23]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[24]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[25]  Pushpak Bhattacharyya,et al.  Text Clustering using Semantics , 2002 .

[26]  Frank S. C. Tseng,et al.  An integration of fuzzy association rules and WordNet for document clustering , 2010, Knowledge and Information Systems.

[27]  Wei Song,et al.  Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures , 2009, Expert Syst. Appl..

[28]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.

[29]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[30]  David W. Patterson,et al.  Contextual Document Clustering , 2004, ECIR.

[31]  Mykola Galushka,et al.  A scaleable document clustering approach for large document corpora , 2006, Inf. Process. Manag..

[32]  Yong Wang,et al.  Document Clustering with Semantic Analysis , 2006, Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS'06).