论文信息 - Text Clustering using a WordNet-based Knowledge-Base and the Lesk Algorithm

Text Clustering using a WordNet-based Knowledge-Base and the Lesk Algorithm

paper we are proposing a text clustering method based on a well-known Word Sense Disambiguation (WSD) algorithm, the Lesk algorithm, to classify textual data by doing highly accurate Word Sense Disambiguation. The clustering of text data is thus primarily based on the context or meaning of the words used for clustering. The Lesk algorithm is used to return the sense identifiers for the words used to classify the text files by looking up the senses of a word in a Knowledge-Base similar to the English WordNet (enriched with more informative columns or fields for each synset (synonym set) of the English WordNet database), so as to greatly increase the chances of contextual overlap, thereby resulting in high accuracy of proper sense or context identification of the words. The proposed scheme has been tested on a number of heterogeneous text document datasets. The clustering results and accuracies, obtained using the proposed scheme, have been compared with the results obtained using the K-means clustering algorithm on the Vector Space Models generated for all the heterogeneous textual datasets. Experimental results show that our algorithm performs much better than the Vector Space Model (VSM) and K-means based approach. The technique will thus help the users much better in searching for meaningful contextual information from a highly diversified collection of textual information, which is a key task of the information overload problem.

Alok Chakrabarty | Jyotirmayee Choudhury | Deepesh Kumar Kimtani

[1] José Carlos Cortizo,et al. The Role of Word Sense Disambiguation in Automated Text Categorization , 2005, NLDB.

[2] Joshua Zhexue Huang,et al. A Text Clustering System based on k-means Type Subspace Clustering and Ontology , 2008 .

[3] Ying Liu,et al. Using WordNet to Disambiguate Word Senses for Text Classification , 2007, International Conference on Computational Science.

[4] D. Manimegalai,et al. Query based Text Document Clustering using its Hypernymy Relation , 2011 .

[5] George Karypis,et al. A Comparison of Document Clustering Techniques , 2000 .

[6] Gerhard Weikum,et al. Word Sense Disambiguation for Exploiting Hierarchical Thesauri in Text Classification , 2005, PKDD.

[7] Vasudeva Varma,et al. Multilingual Document Clustering Using Wikipedia as External Knowledge , 2011, IRFC.