Document Expansion Based on WordNet for Robust IR

The use of semantic information to improve IR is a long-standing goal. This paper presents a novel Document Expansion method based on a WordNet-based system to find related concepts and words. Expansion words are indexed separately, and when combined with the regular index, they improve the results in three datasets over a state-of-the-art IR engine. Considering that many IR systems are not robust in the sense that they need careful fine-tuning and optimization of their parameters, we explored some parameter settings. The results show that our method is specially effective for realistic, non-optimal settings, adding robustness to the IR engine. We also explored the effect of document length, and show that our method is specially successful with shorter documents.

[1]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[2]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[3]  Amit Singhal,et al.  Document expansion for speech retrieval , 1999, SIGIR '99.

[4]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[5]  John Tait,et al.  Word sense disambiguation in information retrieval revisited , 2003, SIGIR.

[6]  Hae-Chang Rim,et al.  Information retrieval using word senses: root sense tagging approach , 2004, SIGIR '04.

[7]  Oren Kurland,et al.  Corpus structure, language models, and ad hoc information retrieval , 2004, SIGIR '04.

[8]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[9]  Sebastiano Vigna,et al.  MG4J at TREC 2005 , 2005, TREC.

[10]  Clement T. Yu,et al.  Word sense disambiguation in queries , 2005, CIKM '05.

[11]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[12]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[13]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[14]  Mihai Surdeanu,et al.  Learning to Rank Answers on Large Online QA Collections , 2008, ACL.

[15]  ChengXiang Zhai,et al.  A general optimization framework for smoothing language models on graph structures , 2008, SIGIR '08.

[16]  Carol Peters,et al.  CLEF 2008: Ad Hoc Track Overview , 2008, CLEF.

[17]  Arantxa Otegi,et al.  CLEF 2009 Ad Hoc Track Overview: Robust - WSD Task , 2009, CLEF.

[18]  Jian-Yun Nie,et al.  Smoothing document language model with local word graph , 2009, CIKM.

[19]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[20]  Eneko Agirre,et al.  A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches , 2009, NAACL.

[21]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[22]  Anselmo Peñas,et al.  Overview of ResPubliQA 2009: Question Answering Evaluation over European Legislation , 2009, CLEF.

[23]  Kevyn Collins-Thompson,et al.  Reducing the risk of query expansion via robust constrained optimization , 2009, CIKM.

[24]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..