Corpus-based stemming using cooccurrence of word variants

Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.

[1]  Gerald Salton,et al.  Automatic text processing , 1988 .

[2]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[3]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[4]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[5]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[6]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[7]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[8]  James P. Callan,et al.  An Overview of the INQUERY System as Used for the TIPSTER Project , 1993 .

[9]  D. K. Harmon,et al.  Overview of the Third Text Retrieval Conference (TREC-3) , 1996 .

[10]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[11]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[12]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[13]  Howard R. Turtle Natural language vs. Boolean query evaluation: a comparison of retrieval performance , 1994, SIGIR '94.

[14]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[15]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[16]  W. Bruce Croft,et al.  Corpus-Specific Stemming using Work Form Co-occurrence , 1994 .

[17]  Peter Willett,et al.  The effectiveness of stemming for natural‐language access to Slovene textual data , 1992 .

[18]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[19]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[20]  Donna Harman,et al.  How effective is suffixing , 1991 .

[21]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[22]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .