Using Linguistic Knowledge in Information Retrieval Technical Report

The current practice in Information Retrieval is largely based on statistical techniques. These techniques are reasonably successful but many researchers believe that statistical techniques have reached their upper bound. Some recent research in IR is aimed at investigating whether Natural Language Processing techniques can be used to improve the performance of existing retrieval strategies. In the UPLIFT project (Utrecht Project: Linguistic Information for Free Text retrieval) we want to investigate whether the addition of linguistic information will improve the performance of a statistical retrieval engine for the Dutch language. During the first phase of the project, which is now completed, we concentrated on morphological and semantic information (synonymy relations). Morphological information can be used during document indexing. The variation of index terms is reduced by using stems instead of word forms as the basis for indexing. Many algorithms have been developed to reduce word forms to their ‘stem’, ranging from simple non-linguistic truncation algorithms to dictionary-based linguistic algorithms. Previous research on stemming has shown both positive and negative effects on retrieval performance. In this report we will describe experiments in which several linguistic and non-linguistic stemmers were evaluated on a Dutch test collection. Results show that linguistic stemming can yield a significant improvement in Recall over non-linguistic stemming, without causing a significant deterioration in Precision. Besides testing morphological algorithms, we also experimented with a synonym database. This database was used to expand query terms with synonymous expressions. Results of our experiments show that synonym expansion is potentially useful but disambiguation of query terms is essential.

[1]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[2]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[3]  Jean M. Tague,et al.  The pragmatics of information retrieval experimentation , 1981 .

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[6]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[7]  Donna K. Harman,et al.  How effective is suffixing? , 1991, J. Am. Soc. Inf. Sci..

[8]  IJsbrand Jan Aalbersberg,et al.  Incremental relevance feedback , 1992, SIGIR '92.

[9]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[10]  Donna Harman,et al.  The First Text REtrieval Conference (TREC-1) , 1993 .

[11]  Jacques Savoy,et al.  Stemming of French Words Based on Grammatical Categories , 1993, J. Am. Soc. Inf. Sci..

[12]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[13]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[14]  James Blustein,et al.  A Statistical Analysis of the TREC-3 Data , 1995, TREC.

[15]  Wessel Kraaij,et al.  Evaluation of a Dutch stemming algorithm , 1994 .

[16]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[17]  John K. Ousterhout,et al.  Tcl and the Tk Toolkit , 1994 .

[18]  Wessel Kraaij,et al.  Porter's stemming algorithm for Dutch , 1994 .

[19]  Chris D. Paice An evaluation method for stemming algorithms , 1994, SIGIR '94.

[20]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[21]  Kenneth Ward Church One term or two? , 1995, SIGIR '95.

[22]  Ellen Riloff,et al.  Little words can make a big difference for text classification , 1995, SIGIR '95.

[23]  Jean Tague-Sutcliffe,et al.  Measuring information : an information services perspective , 1995 .

[24]  Donna Harman,et al.  The Second Text Retrieval Conference (TREC-2) , 1995, Inf. Process. Manag..

[25]  R. H. Baayen,et al.  The CELEX Lexical Database (CD-ROM) , 1996 .