Design and Development of a Dictionary Based Stemmer for Marathi Language

Stemming is one of the term conflation techniques used to reduce morphological variations of the term into a unique term called as “stem”. Stemming is one of the significant pre-processing steps performed in various applications of natural language processing (NLP) and information retrieval (IR): like machine translation, named entity recognition, automated document processing, etc. In this paper, we focus on the development of automated stemmer for the Marathi language. We have adopted the dictionary lookup technique for this task. The experiment is tested on news articles in the Marathi language consists of 4500 words. The proposed stemmer achieved a maximum accuracy of 80.6% when tested on nine different runs. The over-stemming error rate is low. The satisfactory result of proposed stemmer encourages us to use this stemmer for the information retrieval task.

[1]  Chris D. Paice,et al.  Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[2]  Dinesh Kumar,et al.  Design and Development of a Stemmer for Punjabi , 2010 .

[3]  Jacques Savoy,et al.  Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages , 2010, TALIP.

[4]  Luciana S. Buriol,et al.  A study on the use of stemming for monolingual ad-hoc Portuguese information retrieval , 2007 .

[5]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[6]  Chandra Prakash,et al.  MAULIK: An Effective Stemmer for Hindi Language , 2012 .

[7]  Donna Harman,et al.  How effective is suffixing , 1991 .

[8]  Tanveer J. Siddiqui,et al.  Discovering suffixes: A Case Study for Marathi Language , 2010 .

[9]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[10]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[11]  Harshali B. Patil,et al.  MarS: A rule-based stemmer for morphologically rich language Marathi , 2017, 2017 International Conference on Computer, Communications and Electronics (Comptelix).

[12]  Harshali B. Patil,et al.  A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES , 2016 .

[13]  Mohd. Shahid Husain An Unsupervised Approach to Develop Stemmer , 2012 .

[14]  Kazem Taghva,et al.  A stemming algorithm for the Farsi language , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[15]  Ophir Frieder,et al.  On arabic search: improving the retrieval effectiveness via a light stemming approach , 2002, CIKM '02.

[16]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.