Stemming and Decompounding for German Text Retrieval

The stemming problem, i.e. finding a common stem for different forms of a term, has been extensively studied for English, but considerably less is known for other languages. Previously, it has been claimed that stemming is essential for highly declensional languages. We report on our experiments on stemming for German, where an additional issue is the handling of compounds, which are formed by concatenating several words. Rarely do studies on stemming for any language cover more than one or two different approaches. This paper makes a major contribution that transcends its focus on German by investigating a complete spectrum of approaches, ranging from language-independent to elaborate linguistic methods. The main findings are that stemming is beneficial even when using a simple approach, and that carefully designed decompounding, the splitting of compound words, remarkably boosts performance. All findings are based on a thorough analysis using a large reliable test collection.

[1]  Jacques Savoy A stemming procedure and stopword list for general French corpora , 1999 .

[2]  David A. Hull Stemming algorithms: a case study for detailed evaluation , 1996 .

[3]  Stephen Tomlinson Stemming Evaluated in 6 Languages by Hummingbird SearchServerTM at CLEF 2001 , 2001, CLEF.

[4]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[5]  Jacques Savoy,et al.  Cross-language information retrieval: experiments based on CLEF 2000 corpora , 2003, Inf. Process. Manag..

[6]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[7]  Donna K. Harman,et al.  The TREC Conferences , 1997, HIM.

[8]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[9]  Isabelle Moulinier,et al.  West Group at CLEF2000: Non-English Monolingual Retrieval , 2000, CLEF.

[10]  Christa Womser-Hacker,et al.  Das deutsche Patentinformationssystem : Entwicklungstendenzen, Retrievaltests und Bewertungen , 1990 .

[11]  Maarten de Rijke,et al.  Shallow Morphological Analysis in Monolingual Information Retrieval for Dutch, German, and Italian , 2001, CLEF.

[12]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[13]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[14]  Jean Paul Ballerini,et al.  Experiments in multilingual information retrieval using the SPIDER system , 1996, SIGIR '96.

[15]  Peter Schäuble,et al.  Multl-Language Text Indexing for Internet Retrieval , 1997, RIAO.

[16]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[17]  Bärbel Ripplinger Mpro-IR in Clef 2001 , 2001, CLEF.

[18]  Peter Willett,et al.  The effectiveness of stemming for natural‐language access to Slovene textual data , 1992 .

[19]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[20]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[21]  Carol Peters,et al.  Evaluation of Cross-Language Information Retrieval Systems , 2002, Lecture Notes in Computer Science.

[22]  Hinrich Schütze,et al.  Xerox TREC-5 Site Report: Routing, Filtering, NLP, and Spanish Tracks , 1996, TREC.

[23]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[24]  William B. Frakes,et al.  Stemming Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[25]  Martin Braschler,et al.  Experiments with the Eurospider Retrieval System for CLEF 2000 , 2000, CLEF.

[26]  Donna Harman,et al.  How effective is suffixing , 1991 .

[27]  Christine D. Piatko,et al.  The JHU/APL HAIRCUT System at TREC-8 , 1999, TREC.