A New Experience in Persian Text Clustering using FarsNet Ontology

Clustering through organizing large text corpora has a key role in an easy navigation and browsing of massive amounts of text data and in particular in search engines. The documents comparison using the conventional clustering techniques is based on the surface similarities of words or extracted morphemes. This leads to non-semantic clusters usually. In this paper, Farsi, also known as Persian, has been taken into account with regards to the fact that the amount of electronic Farsi texts are growing rapidly. The documents are enriched by using semantic relationships-synonymy, hypernymy and hyponymy- extracted from FarsNet lexical ontology. A WSD procedure is proposed to decrease uncertainty. After preprocessing routines, three clustering algorithms including Bisecting K-means, LSI and PLSI based clustering is applied on the pre-categorized Persian Hamshahri corpus. Experimental results show the improvement of clustering quality when text data is enriched by the semantic relations especially using PLSI based approach.

[1]  Hamid Parvin,et al.  Improving Persian Text Classification and Clustering Using Persian Thesaurus , 2012, DCAI.

[2]  Samah Jamal Fodeh,et al.  On ontology-driven document clustering using core semantic features , 2011, Knowledge and Information Systems.

[3]  Nitin Indurkhya,et al.  Handbook of Natural Language Processing , 2010 .

[4]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[5]  Ramiz M. Aliguliyev,et al.  Clustering of document collection - A weighting approach , 2009, Expert Syst. Appl..

[6]  Farhad Oroumchian,et al.  Assessment of a Modern Farsi Corpus , 2005 .

[7]  Mohammad Mehdi Homayounpour,et al.  Improving Farsi multiclass text classification using a thesaurus and two‐stage feature selection , 2011, J. Assoc. Inf. Sci. Technol..

[8]  Akbar Hesabi,et al.  Semi Automatic Development of FarsNet ; The Persian WordNet , 2009 .

[9]  M. R. Davarpanah,et al.  Farsi lexical analysis and stop word list , 2009, Libr. Hi Tech.

[10]  Steffen Staab,et al.  Text Clustering Based on Background Knowledge , 2003 .

[11]  Shigeichi Hirasawa,et al.  Student Questionnaire Analyses Using the Clustering Method based on the PLSI Model , 2009 .

[12]  Nick Cercone,et al.  Cross-Lingual Word Sense Disambiguation for Languages with Scarce Resources , 2011, Canadian Conference on AI.

[13]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[14]  James Mayfield,et al.  Addressing morphological variation in alphabetic languages , 2009, SIGIR.

[15]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[16]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[17]  Hong-Gee Kim,et al.  Exploiting noun phrases and semantic relationships for text document clustering , 2009, Inf. Sci..

[18]  Seungjin Choi,et al.  Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on Stiefel manifolds , 2010, Inf. Process. Manag..

[19]  Christos Bouras,et al.  A clustering technique for news articles using WordNet , 2012, Knowl. Based Syst..

[20]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[21]  M. Ng,et al.  Ontology-based Distance Measure for Text Clustering , 2006 .

[22]  Fariborz Mahmoudi,et al.  Evaluation of Perstem: A Simple and Efficient Stemming Algorithm for Persian , 2009, CLEF.