WordNet and Semantic similarity based approach for document clustering

With the ceaseless flourishing of the internet, the number of text documents in electronic form is increasing exorbitantly. Thus document clustering which organizes such large collections of documents into meaningful clusters has become an important technique. Traditional clustering methods cluster documents based on statistical features. Thus, documents clustered together using traditional clustering methods are not conceptually similar to one another as semantic relationships between documents are ignored. In this paper, a model for document clustering that groups documents with similar concepts together is introduced. Proposed model initially identifies all the coreferences in each of the documents in the collection. Polysemy and synonymy problems are tackled by capturing an appropriate sense of the word based on its context using the WordNet and the Semantic similarity. The proposed clustering model is implemented for the classic4 dataset and the results show an improvement in the efficiency.

[1]  David Sánchez,et al.  Ontology-based semantic similarity: A new feature-based approach , 2012, Expert Syst. Appl..

[2]  Vipul K. Dabhi,et al.  A survey on semantic document clustering , 2015, 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT).

[3]  Thabet Slimani,et al.  Description and Evaluation of Semantic Similarity Measures Approaches , 2013, ArXiv.

[4]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[5]  Leena H. Patil,et al.  A Semantic approach for effective document clustering using WordNet , 2013, ArXiv.

[6]  George Karypis,et al.  Document Clustering , 2010, Encyclopedia of Machine Learning.

[7]  Stefan Wermter,et al.  Hybrid neural document clustering using guided self-organization and WordNet , 2004, IEEE Intelligent Systems.

[8]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[9]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[10]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .