Improving Persian Information Retrieval Systems Using Stemming and Part of Speech Tagging

With the emergence of vast resources of information, it is necessary to develop methods that retrieve the most relevant information according to needs. These retrieval methods may benefit from natural language constructs to boost their results by achieving higher precision and recall rates. In this study, we have used part of speech properties of terms as extra source of information about document and query terms and have evaluated the impact of such data on the performance of the Persian retrieval algorithms. Furthermore the effect of stemming has been experimented as a complement to this research. Our findings indicate that part of speech tags may have small influence on effectiveness of the retrieved results. However, when this information is combined with stemming it improves the accuracy of the outcomes considerably.

[1]  Masoud Rahgozar,et al.  Using Heuristic Rules to Improve Persian Part of Speech Tagging Accuracy , 2008 .

[2]  Farhad Oroumchian,et al.  Creating a Feasible Corpus for Persian POS Tagging , 2007 .

[3]  Carol Peters,et al.  CLEF 2008: Ad Hoc Track Overview , 2008, CLEF.

[4]  Thorsten Brants,et al.  Natural Language Processing in Information Retrieval , 2003, CLIN.

[5]  Chirag Shah A Study for Evaluating the Importance of Various Parts of Speech (POS) for Information Retrieval (IR) , 2002 .

[6]  Elizabeth D. Liddy,et al.  Interpretation of Proper Nouns for Information Retrieval , 1993, HLT.

[7]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[8]  W. Bruce Croft,et al.  Indri : A language-model based search engine for complex queries ( extended version ) , 2005 .

[9]  Farhad Oroumchian,et al.  N-gram and Local Context Analysis for Persian text retrieval , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[10]  Min-Yen Kan,et al.  Role of Verbs in Document Analysis , 1998, ACL.

[11]  Johan Carlberger,et al.  Implementing an Efficient Part-Of-Speech Tagger , 1999, Softw. Pract. Exp..

[12]  Walt Detmar Meurers,et al.  Encyclopedia of Language and Linguistics , 2006 .

[13]  Hadi Amiri,et al.  Using OWA fuzzy operator to merge retrieval system results , 2007 .

[14]  Thorsten Brants,et al.  TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[15]  Simin Karimi,et al.  Aspects of Iranian Linguistics , 2009 .

[16]  Michael L. Littman,et al.  Unsupervised Learning of Semantic Orientation from a Hundred-Billion-Word Corpus , 2002, ArXiv.

[17]  Farhad Oroumchian,et al.  Evaluation of part of speech tagging on Persian text , 2007 .

[18]  Farhad Oroumchian,et al.  Investigation on a Feasible Corpus for Persian POS Tagging , 2007 .

[19]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.