Solving the word mismatch problem through automatic text analysis

Solving The Word Mismatch Problem Through Automatic Text Analysis May 1997 Jinxi Xu, B.S., Hunan University M.S., Institute of Computing Technology, the Chinese Academy of Sciences Ph.D., University of Massachusetts Amherst Directed by: Professor W. Bruce Croft Information Retrieval (IR) is concerned with locating documents that are relevant for a user's information need or query from a large collection of documents. A fundamental problem for information retrieval is word mismatch. A query is usually a short and incomplete description of the underlying information need. The users of IR systems and the authors of the documents often use di erent words to refer to the same concepts. This thesis addresses the word mismatch problem through automatic text analysis. We investigate two text analysis techniques, corpus analysis and local context analysis, and apply them in two domains of word mismatch, stemming and general query expansion. Experimental results show that these techniques can result in more e ective retrieval. vi Table of

[1]  Donna K. Harman,et al.  Overview of the Third Text REtrieval Conference (TREC-3) , 1995, TREC.

[2]  Gregory Grefenstette,et al.  Use of syntactic context to produce term association lists for text retrieval , 1992, SIGIR '92.

[3]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[4]  Peter Willett,et al.  The limitations of term co-occurrence data for query expansion in document retrieval systems , 1991, J. Am. Soc. Inf. Sci..

[5]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[6]  Hinrich Schütze,et al.  A Cooccurrence-Based Thesaurus and Two Applications to Information Retrieval , 1994, Inf. Process. Manag..

[7]  Ellen Riloff,et al.  Information extraction as a basis for high-precision text classification , 1994, TOIS.

[8]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[9]  Susan T. Dumais,et al.  Learned Vector-Space Models for Document Retrieval , 1995, Inf. Process. Manag..

[10]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[11]  W. Bruce Croft,et al.  Adaptive query modification in a probabilistic information retrieval model , 1996 .

[12]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[13]  Peter Willett,et al.  The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , 1992, J. Am. Soc. Inf. Sci..

[14]  Vijay V. Raghavan,et al.  On modeling of information retrieval concepts in vector spaces , 1987, TODS.

[15]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[16]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[17]  Michael E. Lesk,et al.  Computer Evaluation of Indexing and Text Processing , 1968, JACM.

[18]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[19]  Eric W. Brown,et al.  Execution performance issues in full-text information retrieval , 1995 .

[20]  RiloffEllen,et al.  Information extraction as a basis for high-precision text classification , 1994 .

[21]  Richard M. Tong,et al.  A knowledge representation for conceptual information retrieval , 1989, Int. J. Intell. Syst..

[22]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[23]  Mark S. Tuttle,et al.  Implementing Meta-1: The First Version of the UMLS Metathesaurus*. , 1989 .

[24]  Jinxi Xu,et al.  The Design and Implementation of a Part of Speech Tagger for English , 1994 .

[25]  Norbert Fuhr,et al.  Probabilistic Datalog—a logic for powerful retrieval methods , 1995, SIGIR '95.

[26]  Jack Minker,et al.  An evaluation of query expansion by the addition of clustered terms for a document retrieval system , 1972, Inf. Storage Retr..

[27]  Davis B. McCarn Medline: An introduction to on-line searching , 1980, J. Am. Soc. Inf. Sci..

[28]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[29]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[30]  David A. Hull Using statistical testing in the evaluation of retrieval experiments , 1993, SIGIR.

[31]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[32]  Kathleen J. Mullen,et al.  Agricultural Policies in India , 2018, OECD Food and Agricultural Reviews.

[33]  W. Bruce Croft Using boolean queries with a clustered file organization , 1979, J. Am. Soc. Inf. Sci..

[34]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[35]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[36]  Karen Spärck Jones,et al.  The use of automatically-obtained keyword classifications for information retrieval , 1969, Inf. Storage Retr..

[37]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[38]  W. Bruce Croft,et al.  Language‐oriented information retrieval , 1989, Int. J. Intell. Syst..

[39]  W. Bruce Croft,et al.  Document Retrieval and Routing Using the INQUERY System , 1994, TREC.

[40]  Gerard Salton,et al.  On the use of spreading activation methods in automatic information , 1988, SIGIR '88.

[41]  Gerda Ruge,et al.  Experiments on Linguistically-Based Term Associations , 1992, Inf. Process. Manag..

[42]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[43]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[44]  Aviezri S. Fraenkel,et al.  Local Feedback in Full-Text Retrieval Systems , 1977, JACM.

[45]  W. Bruce Croft,et al.  Retrieving documents by plausible inference: An experimental study , 1989, Inf. Process. Manag..

[46]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[47]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[48]  Marti A. Hearst,et al.  Reexamining the cluster hypothesis: scatter/gather on retrieval results , 1996, SIGIR '96.

[49]  Michael Lesk,et al.  Word-word associations in document retrieval systems , 1969 .

[50]  S. Robertson The probability ranking principle in IR , 1997 .

[51]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[52]  W. Bruce Croft,et al.  Inference networks for document retrieval , 1989, SIGIR '90.

[53]  Carolyn J. Crouch,et al.  Experiments in automatic statistical thesaurus construction , 1992, SIGIR '92.

[54]  C. J. van Rijsbergen,et al.  Towards an information logic , 1989, SIGIR '89.

[55]  Edward A. Fox,et al.  Using a frame‐based language for information retrieval , 1989, Int. J. Intell. Syst..

[56]  Jianhua Dong,et al.  Ad Hoc Experiments Using EUREKA , 1996, TREC.

[57]  Ron Sacks-Davis,et al.  Similarity Measures for Short Queries , 1995, TREC.

[58]  James Allan,et al.  Recent Experiments with INQUERY , 1995, TREC.

[59]  Umberto Straccia,et al.  A relevance terminological logic for information retrieval , 1996, SIGIR '96.

[60]  Gerald Salton,et al.  Automatic text processing , 1988 .

[61]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[62]  Wessel Kraaij,et al.  Viewing stemming as recall enhancement , 1996, SIGIR '96.

[63]  Gerard Salton,et al.  Automatic term class construction using relevance--A summary of work in automatic pseudoclassification , 1980, Inf. Process. Manag..

[64]  Susan T. Dumais,et al.  Latent Semantic Indexing (LSI): TREC-3 Report , 1994, TREC.

[65]  J. Ponte USe: A Retargetable Word Segmentation Procedure for Information Retrieval , 1996 .

[66]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.