Stamantic clustering: Combining statistical and semantic features for clustering of large text datasets

Abstract Document clustering in text mining is a problem that is heavily researched upon. It is observed that individual approaches based on statistical features and semantic features have been extensively used to solve this problem. However, techniques combining the advantages of both types of features have not been frequently researched upon. Specifically, when the growth in the size of textual data is immense, there is a need for such an approach that combines the advantages of both types of features to give more accurate results within an acceptable range of time. In this paper, a document clustering technique is proposed that combines the effectiveness of the statistical features (using TF-IDF) and semantic features (using lexical chains). It is designed to use a fewer number of features while maintaining a comparable and even better accuracy for the task of document clustering.

[1]  Diego Reforgiato Recupero A new unsupervised method for document clustering by using WordNet lexical and conceptual relations , 2007 .

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[4]  Man Lan,et al.  A comparative study on term weighting schemes for text categorization , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[5]  Balaraman Ravindran,et al.  Lexical Chains as Document Features , 2008, IJCNLP.

[6]  Balaraman Ravindran,et al.  Document Clustering using Lexical Chains , 2007 .

[7]  Ricardo J. G. B. Campello,et al.  Combining semantic and term frequency similarities for text clustering , 2019, Knowledge and Information Systems.

[8]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[9]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[10]  Samah Jamal Fodeh,et al.  Combining statistics and semantics via ensemble model for document clustering , 2009, SAC '09.

[11]  Qiang Zhou,et al.  A semantic approach for text clustering using WordNet and lexical chains , 2015, Expert Syst. Appl..

[12]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[13]  Anil K. Jain Data clustering: 50 years beyond K-means , 2010, Pattern Recognit. Lett..

[14]  Dae-Won Kim,et al.  Exploiting concept clusters for content-based information retrieval , 2005, Inf. Sci..

[15]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[16]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[17]  Kathleen F. McCoy,et al.  Efficiently Computed Lexical Chains as an Intermediate Representation for Automatic Text Summarization , 2002, CL.

[18]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[19]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[20]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Hakan Altinçay,et al.  Analytical evaluation of term weighting schemes for text categorization , 2010, Pattern Recognit. Lett..

[22]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[23]  Samah Jamal Fodeh,et al.  On ontology-driven document clustering using core semantic features , 2011, Knowledge and Information Systems.

[24]  Steffen Staab,et al.  Ontologies improve text document clustering , 2003, Third IEEE International Conference on Data Mining.

[25]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .