An Unsupervised Approach to Develop Stemmer

This paper presents an unsupervised approach for the development of a stemmer (For the case of Urdu & Marathi language). Especially, during last few years, a wide range of information in Indian regional languages has been made available on web in the form of e-data. But the access to these data repositories is very low because the efficient search engines/retrieval systems supporting these languages are very limited. Hence automatic information processing and retrieval is become an urgent requirement. To train the system training dataset, taken from CRULP [22] and Marathi corpus [23] are used. For generating suffix rules two different approaches, namely, frequency based stripping and length based stripping have been proposed. The evaluation has been made on 1200 words extracted from the Emille corpus. The experiment results shows that in the case of Urdu language the frequency based suffix generation approach gives the maximum accuracy of 85.36% whereas Length based suffix stripping algorithm gives maximum accuracy of 79.76%. In the case of Marathi language the systems gives 63.5% accuracy in the case of frequency based stripping and achieves maximum accuracy of 82.5% in the case of length based suffix stripping algorithm.

[1]  Naglaa Thabet Stemming the Qur’an , 2004 .

[2]  Sarmad Hussain,et al.  Assas-band, an Affix-Exception-List Based Urdu Stemmer , 2009, ALR7@IJCNLP.

[3]  Tanveer J. Siddiqui,et al.  An unsupervised Hindi stemmer with heuristic improvements , 2008, AND '08.

[4]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[5]  Miriam Butt,et al.  NON-NOMINATIVE SUBJECTS IN URDU A COMPUTATIONAL ANALYSIS , 2001 .

[6]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[7]  S.M.J. Rizvi,et al.  Analysis, Design and Implementation of Urdu Morphological Analyzer , 2005, 2005 Student Conference on Engineering Sciences and Technology.

[8]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[9]  Jacques Savoy,et al.  Stemming of French Words Based on Grammatical Categories , 1993, J. Am. Soc. Inf. Sci..

[10]  Mehrnoush Shamsfard,et al.  A Bottom Up approach to Persian Stemming , 2008, IJCNLP.

[11]  Alireza Mokhtaripour,et al.  Introduction to a new Farsi stemmer , 2006, CIKM '06.

[12]  S.M.J. Rizvi,et al.  Modeling case marking system of Urdu-Hindi languages by using semantic information , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[13]  Swapan K. Parui,et al.  A Simple Stemmer for Inflectional Languages , 2008 .

[14]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[15]  Richard Wicentowski Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model , 2004, SIGMORPHON@ACL.