An effective approach to document retrieval via utilizing WordNet and recognizing phrases

Noun phrases in queries are identified and classified into four types: proper names, dictionary phrases, simple phrases and complex phrases. A document has a phrase if all content words in the phrase are within a window of a certain size. The window sizes for different types of phrases are different and are determined using a decision tree. Phrases are more important than individual terms. Consequently, documents in response to a query are ranked with matching phrases given a higher priority. We utilize WordNet to disambiguate word senses of query terms. Whenever the sense of a query term is determined, its synonyms, hyponyms, words from its definition and its compound words are considered for possible additions to the query. Experimental results show that our approach yields between 23% and 31% improvements over the best-known results on the TREC 9, 10 and 12 collections for short (title only) queries, without using Web data.

[1]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[4]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[5]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[6]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[7]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[8]  Ellen M. Voorhees,et al.  Overview of the TREC 2004 Robust Retrieval Track , 2004 .

[9]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[10]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[11]  John Tait,et al.  Word sense disambiguation in information retrieval revisited , 2003, SIGIR.

[12]  W. Bruce Croft,et al.  Lexical ambiguity and information retrieval , 1992, TOIS.

[13]  Hans-Peter Frei,et al.  Concept based query expansion , 1993, SIGIR.

[14]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[15]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[16]  Claire Cardie,et al.  An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[17]  Christiane Fellbaum,et al.  Using Wordnet for Text Retrieval , 1998 .

[18]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[19]  Yasushi Ogawa,et al.  Structuring and expanding queries in the probabilistic model , 1999, TREC.

[20]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[21]  Tao Tao,et al.  Improving the Robustness of Language Models - UIUC TREC 2003 Robust and Genomics Experiments , 2003, TREC.

[22]  Joel L Fagan,et al.  Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods , 1987 .

[23]  Dekang Lin,et al.  PRINCIPAR - An Efficient, Broad-coverage, Principle-based Parser , 1994, COLING.

[24]  Rada Mihalcea,et al.  Word sense disambiguation with pattern learning and automatic feature selection , 2002, Natural Language Engineering.

[25]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[26]  Sumio Fujita Reflections on "Aboutness" TREC-9 Evaluation Experiments at Justsystem , 2000, TREC.

[27]  C. J. van Rijsbergen,et al.  Probabilistic models of information retrieval based on measuring the divergence from randomness , 2002, TOIS.

[28]  Erich Novak,et al.  Special issue , 2006, J. Complex..

[29]  Alan F. Smeaton,et al.  Using WordNet in a Knowledge-Based Approach to Information Retrieval , 1995 .

[30]  Charles L. A. Clarke,et al.  Task-Specific Query Expansion (MultiText Experiments for TREC 2003) , 2003, TREC.

[31]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[32]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[33]  Clement T. Yu,et al.  Priniples of Database Query Processing for Advanced Applications , 1997 .

[35]  Claudio Carpineto,et al.  FUB at TREC-10 Web Track: A Probabilistic Framework for Topic Relevance Term Weighting , 2001, TREC.

[36]  Gerard Salton,et al.  Optimization of relevance feedback weights , 1995, SIGIR '95.

[37]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics (The Kluwer International Series on Information Retrieval) , 2004 .

[38]  Rada Mihalcea,et al.  Semantic Indexing using WordNet Senses , 2000 .

[39]  Kui-Lam Kwok,et al.  TREC 2003 Robust, HARD and QA Track Experiments using PIRCS , 2003, TREC.