Stemming Strategies for European Languages

In this paper, we describe and evaluate different general stemming approaches for the French, Portuguese (Brazilian), German and Hungarian languages. Based on the CLEF test-collections, we demonstrate that light stemming approaches are quite effective for the French, Portuguese and Hungarian languages, and perform reasonably well for the German language. Variations in mean average precision among the different stemmers are also evaluated and are sometimes found to be statistically significant.

[1]  Carol Peters,et al.  Comparative Evaluation of Multilingual Information Access Systems , 2003, Lecture Notes in Computer Science.

[2]  Jacques Savoy Report on CLEF-2003 Monolingual Tracks: Fusion of Probabilistic Models for Effective Monolingual Retrieval , 2003, CLEF.

[3]  Martti Juhola,et al.  Stemming and lemmatization in the clustering of finnish text documents , 2004, CIKM '04.

[4]  H. Markov,et al.  An algorithm to , 1997 .

[5]  Giorgio Maria Di Nunzio,et al.  The University of Padova at CLEF 2003: Experiments to Evaluate Probabilistic Models for Automatic Stemmer Generation and Query Word Translation , 2003, CLEF.

[6]  Martin Braschler,et al.  How Effective is Stemming and Decompounding for German Text Retrieval? , 2004, Information Retrieval.

[7]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[8]  Jacques Savoy,et al.  Stemming of French Words Based on Grammatical Categories , 1993, J. Am. Soc. Inf. Sci..

[9]  Stephen Tomlinson,et al.  Lexical and Algorithmic Stemming Compared for 9 European Languages with Hummingbird SearchServerTM at CLEF 2003 , 2003, CLEF.

[10]  Michael Kluck The GIRT Data in the Evaluation of CLIR Systems - from 1997 Until 2003 , 2003, CLEF.

[11]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[12]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[13]  Richard Sproat,et al.  Morphology and computation , 1992 .

[14]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[15]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[16]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[17]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[18]  Amit Singhal,et al.  AT&T at TREC-7 , 1998, TREC.

[19]  Jacques Savoy,et al.  Statistical inference in retrieval effectiveness evaluation , 1997, Inf. Process. Manag..