论文信息 - Discovering suffixes: A Case Study for Marathi Language

Discovering suffixes: A Case Study for Marathi Language

Suffix stripping is a pre-processing step required in a number of natural language processing applications. Stemmer is a tool used to perform this step. This paper presents and evaluates a rule-based and an unsupervised Marathi stemmer. The rule-based stemmer uses a set of manually extracted suffix stripping rules whereas the unsupervised approach learns suffixes automatically from a set of words extracted from raw Marathi text. The performance of both the stemmers has been compared on a test dataset consisting of 1500 manually stemmed word.

Tanveer J. Siddiqui | Mudassar M. Majgaonker

[1] Matthew G. Snover,et al. A Bayesian Model for Morpheme and Paradigm Identification , 2001, ACL.

[2] John A. Goldsmith,et al. Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[3] Ananthakrishnan Ramanathan,et al. A Lightweight Stemmer for Hindi , 2003 .

[4] Richard Wicentowski. Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model , 2004, SIGMORPHON@ACL.

[5] Leah S. Larkey,et al. Hindi CLIR in thirty days , 2003, TALIP.

[6] Tanveer J. Siddiqui,et al. An unsupervised Hindi stemmer with heuristic improvements , 2008, AND '08.

[7] Julie Beth Lovins,et al. Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[8] Swapan K. Parui,et al. A Simple Stemmer for Inflectional Languages , 2008 .

[9] Dayne Freitag,et al. Morphology Induction from Term Clusters , 2005, CoNLL.

[10] Vincent Ng,et al. Unsupervised morphological parsing of Bengali , 2006, Lang. Resour. Evaluation.

[11] Akshar Bharati,et al. Unsupervised Improvement of Morphological Analyzer for Inflectionally Rich Languages , 2001, NLPRS.