论文信息 - Improving unsupervised stemming by using partial lemmatization coupled with data-based heuristics for Hindi

Improving unsupervised stemming by using partial lemmatization coupled with data-based heuristics for Hindi

and Lemmatization are two important natural language processing techniques widely used in Information Retrieval (IR) for query processing and in Machine Translation (MT) for reducing the data sparseness. Both minimize inflectional forms, and sometimes derivationally related forms of a word, to a common base form. Most of the existing stemmer and lemmatization work is based either on some language dependent rules which require the supervision of a language expert, or some probabilistic approach that needs vast amount of monolingual corpus, both of which develop stemming and lemmatization algorithms independently. In our work, we propose an unsupervised stemming which is hybridized with partial lemmatization for Hindi. The stemmer proposed is unique in that it exploits a novel grouping criteria & aims to improve unsupervised stemming and most importantly avoid over-stemming problem which is a usual phenomena in stemming. The later is tackled by the introduction of lemma. We incorporated lemmatization based on data heuristics obtained from the corpus, without the use of word class information. Application of this concept to unsupervised stemming yielded significant improvements in the desired results when compared to other prevailing approaches of its genre.

Deepa Gupta | Rahul Kumar Yadav | Nidhi Sajan

[1] John A. Goldsmith,et al. Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[2] Tanveer J. Siddiqui,et al. An unsupervised Hindi stemmer with heuristic improvements , 2008, AND '08.

[3] Markus Forsberg,et al. Morphological Lexicon Extraction from Raw Text Data , 2006, FinTAL.

[4] K. V. N. Sunitha,et al. Improving word coverage using unsupervised morphological analyser , 2009 .

[5] P RamakanthKumar,et al. Kannada Morphological Analyser and Generator Using Trie , 2011 .

[6] Benoît Sagot,et al. Building a Morphosyntactic Lexicon and a Pre-syntactic Processing Chain for Polish , 2009, LTC.

[7] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[8] Prasenjit Majumder,et al. YASS: Yet another suffix stripper , 2007, TOIS.

[9] Antoni Oliver,et al. Enlarging the Croatian Morphological Lexicon by Automatic Lexical Acquisition from Raw Corpora , 2004, LREC.