Designing a Bangla Stemmer using rule based approach

Stemming is a preprocessing task for natural language processing that involves normalizing inflected words representing the same concept of the original word. Steaming is a process of text normalization that has many applications. There are many techniques for steaming of inflected words for different languages but very few works for Bangla word steaming. Therefore, stemming Bangla word is a unsolved problem. There are many different situations that can occur in Bangla language for word steaming. In this paper, we present a rule based algorithm to stem Bangla words. We developed the rules for infection detection for verb inflection (বিভক্তি), number inflection (বচন), and others. Using our rules, we developed a system to find the root word of Bangla words and found good performance. Sufficient examples are provided to explain the proposed system.

[1]  Cheng Soon Ong,et al.  On designing an automated Malaysian stemmer for the Malay language (poster session) , 2000, IRAL '00.

[2]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[3]  Peter Willett,et al.  Processing morphological variants in searches of Latin text , 1996, Information Research.

[4]  K. M. Azharul Hasan,et al.  Recognizing Bangla Grammar using Predictive Parser , 2012, ArXiv.

[5]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[6]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[7]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[8]  Sivaji Bandyopadhyay,et al.  Design of a Rule-based Stemmer for Natural Language Text in Bengali , 2008, IJCNLP.

[9]  Prasenjit Majumder,et al.  YASS: Yet another suffix stripper , 2007, TOIS.

[10]  Johannes Leveling,et al.  DCU@FIRE-2012: Rule-based Stemmers for Bengali and Hindi , 2012 .

[11]  Md. Zahurul Islam,et al.  A light weight stemmer for Bengali and its use in spelling checker , 2007 .

[12]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[13]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[14]  K. M. Azharul Hasan,et al.  N-Gram Based Sentiment Mining for Bangla Text Using Support Vector Machine , 2018, 2018 International Conference on Bangla Speech and Language Processing (ICBSLP).

[15]  Michael F. Lynch,et al.  Stemming and N-gram matching for term conflation in Turkish texts , 1996, Information Research.

[16]  Stéphane Bressan,et al.  Indexing the Indonesian Web: Language Identification and Miscellaneous Issues , 2001, WWW Posters.

[17]  Chris D. Paice An evaluation method for stemming algorithms , 1994, SIGIR '94.