Improving unsupervised stemming by using partial lemmatization coupled with data-based heuristics for Hindi

and Lemmatization are two important natural language processing techniques widely used in Information Retrieval (IR) for query processing and in Machine Translation (MT) for reducing the data sparseness. Both minimize inflectional forms, and sometimes derivationally related forms of a word, to a common base form. Most of the existing stemmer and lemmatization work is based either on some language dependent rules which require the supervision of a language expert, or some probabilistic approach that needs vast amount of monolingual corpus, both of which develop stemming and lemmatization algorithms independently. In our work, we propose an unsupervised stemming which is hybridized with partial lemmatization for Hindi. The stemmer proposed is unique in that it exploits a novel grouping criteria & aims to improve unsupervised stemming and most importantly avoid over-stemming problem which is a usual phenomena in stemming. The later is tackled by the introduction of lemma. We incorporated lemmatization based on data heuristics obtained from the corpus, without the use of word class information. Application of this concept to unsupervised stemming yielded significant improvements in the desired results when compared to other prevailing approaches of its genre.