The impact on retrieval effectiveness of skewed frequency distributions

We present an analysis of word senses that provides a fresh insight into the impact of word ambiguity on retrieval effectiveness with potential broader implications for other processes of information retrieval. Using a methodology of forming artifically ambiguous words, known as pseudowords, and through reference to other researchers' work, the analysis illustrates that the distribution of the frequency of occurrance of the senses of a word plays a strong role in ambiguity's impact of effectiveness. Further investigation shows that this analysis may also be applicable to other processes of retrieval, such as Cross Language Information Retrieval, query expansion, retrieval of OCR'ed texts, and stemming. The analysis appears to provide a means of explaining, at least in part, reasons for the processes' impact (or lack of it) on effectiveness.

[1]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[2]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[3]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[4]  Stephen F. Weiss Learning to disambiguate , 1973, Inf. Storage Retr..

[5]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[6]  W. Bruce Croft,et al.  Corpus-based stemming using cooccurrence of word variants , 1998, TOIS.

[7]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[8]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[9]  Adam Kilgarriff,et al.  "I Don’t Believe in Word Senses" , 1997, Comput. Humanit..

[10]  Michael Sussna,et al.  Word sense disambiguation for free-text indexing using a massive semantic network , 1993, CIKM '93.

[11]  Peter Willett Document Retrieval Experiments using Indexing Vocabularies of varying Size. Ii. Hashing, truncation, digram and Trigram Encoding of Index Terms , 1979, J. Documentation.

[12]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[13]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[14]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[15]  George A. Miller WordNet: A Lexical Database for English , 1992, HLT.

[16]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[17]  S. Schwarz INFORMATION SERVICES TO INDUSTRY: THE ROLE OF THE TECHNOLOGICAL UNIVERSITY LIBRARY , 1976 .

[18]  Fabio Crestani,et al.  Promoting Access to White Rose Research Papers Short Queries, Natural Language and Spoken Document Retrieval: Experiments at Glasgow University , 1997 .

[19]  Donna K. Harman,et al.  Overview of the Fifth Text REtrieval Conference (TREC-5) , 1996, TREC.

[20]  Hinrich Schütze,et al.  Information retrieval based on word senses , 1995 .

[21]  Edward A. Fox,et al.  Research Contributions , 2014 .

[22]  Alan F. Smeaton,et al.  Using character shape coding for information retrieval , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[23]  David Yarowsky,et al.  Estimating Upper and Lower Bounds on the Performance of Word-Sense Disambiguation Programs , 1992, ACL.

[24]  Hinrich Schfitze Context Space , 2001 .

[25]  Peter J. L. Wallis,et al.  Information Retrieval based on Paraphrase , 1993 .

[26]  Herbert Coblans,et al.  Progress in Documentation. , 1972 .

[27]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[28]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[29]  George Kingsley Zipf,et al.  Human Behaviour and the Principle of Least Effort: an Introduction to Human Ecology , 2012 .

[30]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[31]  W. Bruce Croft,et al.  Phrasal translation and query expansion techniques for cross-language information retrieval , 1997, SIGIR '97.

[32]  David Cooper,et al.  Document Retrieval Experiments using Indexing Vocabularies of varying Size. I. Variety Generation Symbols Assigned to the Fronts of Index Terms , 1979, J. Documentation.

[33]  Donna K. Harman,et al.  Ranking Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[34]  Kenneth Ward Church,et al.  Work on Statistical Methods for Word Sense Disambiguation , 1992 .

[35]  Chuck Rieger,et al.  Parsing and comprehending with word experts (a theory and its realization) , 1982 .

[36]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[37]  Maurice B. Line,et al.  PROGRESS IN DOCUMENTATION: ‘obsolescence’ and changes in the use of literature with time , 1974 .

[38]  Alan F. Smeaton,et al.  Experiments on using semantic distances between words in image caption retrieval , 1996, SIGIR '96.

[39]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[40]  Jan O. Pedersen Information Retrieval Based on Word Senses , 1995 .

[41]  Kenneth Ward Church One term or two? , 1995, SIGIR '95.

[42]  Donna K. Harman,et al.  A failure analysis of the limitation of suffixing in an online environment , 1987, SIGIR '87.

[43]  Gregory Grefenstette,et al.  Querying across languages: a dictionary-based approach to multilingual information retrieval , 1996, SIGIR '96.