A Hybrid Stemmer for the Affix Stacking Language: Marathi

Stemming is the process of term conflation that reduces the morphological variations of the terms to their common stem. It plays a significant role during preprocessing in most of the natural language processing, text mining, and information retrieval applications. The use of stemmers has proven highly effective for the task of information retrieval for many languages like English and Arabic. This paper focuses on the development of automated stemmer for Marathi language. We have adopted a hybrid technique for the development of proposed stemmer. The goal of this work is to overcome the limitations of the existing stemmers available for Marathi and to enhance the accuracy of Marathi stemming. The proposed stemmer is tested on Marathi news articles and the evaluation of the work shows that significant improvement is obtained in the accuracy, due to the proposed hybrid stemmer over the existing rule-based stemmer. We have achieved an average accuracy of 84.82% with the proposed hybrid stemmer for Marathi.

[1]  Reza Javidan,et al.  A new stemmer for Farsi language , 2011, 2011 CSI International Symposium on Computer Science and Software Engineering (CSSE).

[2]  Ali Behloul,et al.  Implementation of a New Hybrid Method for Stemming of Arabic Text , 2012 .

[3]  Chiranjibi Sitaula,et al.  A Hybrid Algorithm for Stemming of Nepali Text , 2013 .

[4]  Atelach Alemu Argaw,et al.  An Amharic Stemmer : Reducing Words to their Citation Forms , 2007, SEMITIC@ACL.

[5]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[6]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[7]  Chandra Prakash,et al.  MAULIK: An Effective Stemmer for Hindi Language , 2012 .

[8]  Harshali B. Patil,et al.  MarS: A rule-based stemmer for morphologically rich language Marathi , 2017, 2017 International Conference on Computer, Communications and Electronics (Comptelix).

[9]  Harshali B. Patil,et al.  A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES , 2016 .

[10]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[11]  Pushpak Bhattacharyya,et al.  Hybrid Inflectional Stemmer and Rule-based Derivational Stemmer for Gujarati , 2011 .

[12]  Harshali B. Patil,et al.  Design and Development of a Dictionary Based Stemmer for Marathi Language , 2017 .

[13]  Gosse Bouma,et al.  Accurate Stemming of Dutch for Text Classification , 2001, CLIN.

[14]  Mohd. Shahid Husain An Unsupervised Approach to Develop Stemmer , 2012 .

[15]  Ruben Leon,et al.  A word stemming algorithm for the Spanish language , 2000, Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000.

[16]  Pushpak Bhattacharyya,et al.  Hybrid Stemmer for Gujarati , 2010 .

[17]  Bipul Syam Purkayastha,et al.  Development of a Manipuri stemmer: A hybrid approach , 2015, 2015 International Symposium on Advanced Computing and Communication (ISACC).

[18]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[19]  Chris D. Paice,et al.  Method for Evaluation of Stemming Algorithms Based on Error Counting , 1996, J. Am. Soc. Inf. Sci..

[20]  Steffen Eger,et al.  An Ensemble of Classifiers Methodology for Stemming in Inflectional Languages: Using the Example of Latvian , 2010, Baltic HLT.

[21]  Tanveer J. Siddiqui,et al.  Discovering suffixes: A Case Study for Marathi Language , 2010 .

[22]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.