论文信息 - Design of a Rule-based Stemmer for Natural Language Text in Bengali

Design of a Rule-based Stemmer for Natural Language Text in Bengali

This paper presents a rule-based approach for finding out the stems from text in Bengali, a resource-poor language. It starts by introducing the concept of orthographic syllable, the basic orthographic unit of Bengali. Then it discusses the morphological structure of the tokens for different parts of speech, formalizes the inflection rule constructs and formulates a quantitative ranking measure for potential candidate stems of a token. These concepts are applied in the design and implementation of an extensible architecture of a stemmer system for Bengali text. The accuracy of the system is calculated to be ~89% and above.

Sivaji Bandyopadhyay | Sandipan Sarkar | Sivaji Bandyopadhyay | Sandipan Sarkar

[1] Vladimir I. Levenshtein,et al. Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[2] Thorsten Brants,et al. TnT – A Statistical Part-of-Speech Tagger , 2000, ANLP.

[3] Utpal Garain,et al. An approach for stemming in symbolically compressed Indian language imaged documents , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[4] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[5] Ananthakrishnan Ramanathan,et al. A Lightweight Stemmer for Hindi , 2003 .

[6] Prasenjit Majumder,et al. YASS: Yet another suffix stripper , 2007, TOIS.

[7] D. Redmond-Pyle,et al. A Standard for Architecture Description , 1999, IBM Syst. J..