GRAS: An effective and efficient stemming algorithm for information retrieval

A novel graph-based language-independent stemming algorithm suitable for information retrieval is proposed in this article. The main features of the algorithm are retrieval effectiveness, generality, and computational efficiency. We test our approach on seven languages (using collections from the TREC, CLEF, and FIRE evaluation platforms) of varying morphological complexity. Significant performance improvement over plain word-based retrieval, three other language-independent morphological normalizers, as well as rule-based stemmers is demonstrated.

[1]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[2]  Nicola Ferro,et al.  A probabilistic model for stemmer generation , 2005, Inf. Process. Manag..

[3]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[4]  Prasenjit Majumder,et al.  Hungarian and Czech Stemming using YASS , 2007, CLEF.

[5]  Kimmo Kettunen Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval: An overview , 2009, J. Documentation.

[6]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[7]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[8]  Prasenjit Majumder,et al.  Bulgarian, Hungarian and Czech Stemming Using YASS , 2007, CLEF.

[9]  Jacques Savoy,et al.  Searching strategies for the Hungarian language , 2008, Inf. Process. Manag..

[10]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[11]  Donna Harman,et al.  How effective is suffixing , 1991 .

[12]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[13]  Prasenjit Majumder,et al.  YASS: Yet another suffix stripper , 2007, TOIS.

[14]  Jacques Savoy,et al.  Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages , 2010, TALIP.

[15]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[16]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[17]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[18]  Douglas W. Oard,et al.  CLEF Experiments at Maryland: Statistical Stemming and Backoff Translation , 2000, CLEF.

[19]  Christopher J. Fox,et al.  Strength and similarity of affix removal stemming algorithms , 2003, SIGF.

[20]  Jacques Savoy,et al.  Indexing and stemming approaches for the Czech language , 2009, Inf. Process. Manag..