A semantic approach for text clustering using WordNet and lexical chains

A modified WordNet based similarity measure for word sense disambiguation.Lexical chains as text representation for ideally cover the theme of texts.Extracted core semantics are sufficient to reduce dimensionality of feature set.The proposed scheme is able to correctly estimate the true number of clusters.The topic labels have good indicator of recognizing and understanding the clusters. Traditional clustering algorithms do not consider the semantic relationships among words so that cannot accurately represent the meaning of documents. To overcome this problem, introducing semantic information from ontology such as WordNet has been widely used to improve the quality of text clustering. However, there still exist several challenges, such as synonym and polysemy, high dimensionality, extracting core semantics from texts, and assigning appropriate description for the generated clusters. In this paper, we report our attempt towards integrating WordNet with lexical chains to alleviate these problems. The proposed approach exploits ontology hierarchical structure and relations to provide a more accurate assessment of the similarity between terms for word sense disambiguation. Furthermore, we introduce lexical chains to extract a set of semantically related words from texts, which can represent the semantic content of the texts. Although lexical chains have been extensively used in text summarization, their potential impact on text clustering problem has not been fully investigated. Our integrated way can identify the theme of documents based on the disambiguated core features extracted, and in parallel downsize the dimensions of feature space. The experimental results using the proposed framework on reuters-21578 show that clustering performance improves significantly compared to several classical methods.

[1]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[2]  Mostafa M. Aref,et al.  Fuzzy Document Clustering Approach using WordNet Lexical Categories , 2008, SCSS.

[3]  Adam Kilgarriff,et al.  English Senseval: Report and Results , 2000, LREC.

[4]  Louise Guthrie,et al.  Lexical Disambiguation using Simulated Annealing , 1992, COLING.

[5]  Lingling Meng,et al.  A Review of Semantic Similarity Measures in WordNet 1 , 2013 .

[6]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[7]  Avinash C. Kak,et al.  PCA versus LDA , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Steffen Staab,et al.  WordNet improves text document clustering , 2003, SIGIR 2003.

[9]  Christiane Fellbaum,et al.  Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[10]  Samah Jamal Fodeh,et al.  On ontology-driven document clustering using core semantic features , 2011, Knowledge and Information Systems.

[11]  Martha Palmer,et al.  Verb Semantics and Lexical Selection , 1994, ACL.

[12]  Flavius Frasincar,et al.  Domain taxonomy learning from text: The subsumption method versus hierarchical clustering , 2013, Data Knowl. Eng..

[13]  Wei Song,et al.  Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures , 2009, Expert Syst. Appl..

[14]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.

[15]  Samah Jamal Fodeh,et al.  Combining statistics and semantics via ensemble model for document clustering , 2009, SAC '09.

[16]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[17]  Ted Pedersen,et al.  Using semantic relatedness for word sense disambiguation , 2002 .

[18]  Paola Velardi,et al.  Structural semantic interconnections: a knowledge-based approach to word sense disambiguation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[20]  Yuen-Hsien Tseng,et al.  Generic title labeling for clustered documents , 2010, Expert Syst. Appl..

[21]  Graeme Hirst,et al.  Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[22]  Roy Rada,et al.  Development and application of a metric on semantic nets , 1989, IEEE Trans. Syst. Man Cybern..

[23]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[24]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[25]  Graeme Hirst,et al.  Automatically generating hypertext by computing semantic similarity , 1997 .

[26]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[27]  Christos Bouras,et al.  A clustering technique for news articles using WordNet , 2012, Knowl. Based Syst..

[28]  Mirella Lapata,et al.  An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Simone Paolo Ponzetto,et al.  Knowledge-Rich Word Sense Disambiguation Rivaling Supervised Systems , 2010, ACL.

[30]  Abdelmalek Amine,et al.  Evaluation of text clustering methods using wordnet , 2010, Int. Arab J. Inf. Technol..

[31]  Michael Halliday,et al.  Cohesion in English , 1976 .

[32]  Yueming Lu,et al.  WordNet-Based Suffix Tree Clustering Algorithm , 2013, ISCA 2013.

[33]  Michael K. Ng,et al.  Knowledge-based vector space model for text clustering , 2010, Knowledge and Information Systems.

[34]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[35]  David M. Pennock,et al.  Mining the peanut gallery: opinion extraction and semantic classification of product reviews , 2003, WWW '03.

[36]  Diego Reforgiato Recupero,et al.  A new unsupervised method for document clustering by using WordNet lexical and conceptual relations , 2007, Information Retrieval.

[37]  Dae-Won Kim,et al.  Exploiting concept clusters for content-based information retrieval , 2005, Inf. Sci..

[38]  Rada Mihalcea,et al.  Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling , 2005, HLT.

[39]  Rada Mihalcea,et al.  Unsupervised Graph-basedWord Sense Disambiguation Using Measures of Word Semantic Similarity , 2007 .

[40]  Alexandre Termier,et al.  Combining Statistics and Semantics for Word and Document Clustering , 2001, Workshop on Ontology Learning.

[41]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[42]  Flavius Frasincar,et al.  A semantic approach for extracting domain taxonomies from text , 2014, Decis. Support Syst..

[43]  Frank S. C. Tseng,et al.  An integration of WordNet and fuzzy association rule mining for multi-label document clustering , 2010, Data Knowl. Eng..

[44]  Qing-yun Dai,et al.  Research of DSP-based Embedded Systems Connected to the Internet , 2013 .