Thesaurus-Based Feedback to Support Mixed Search and Browsing Environments

We propose and evaluate a query expansion mechanism that supports searching and browsing in collections of annotated documents. Based on generative language models, our feedback mechanism uses document-level annotations to bias the generation of expansion terms and to generate browsing suggestions in the form of concepts selected from a controlled vocabulary (as typically used in digital library settings). We provide a detailed formalization of our feedback mechanism and evaluate its effectiveness using the TREC 2006 Genomics track test set. As to the retrieval effectiveness, we find a 20% improvement in mean average precision over a query-likelihood baseline, whilst increasing precision at 10. When we base the parameter estimation and feedback generation of our algorithm on a large corpus, we also find an improvement over state-of-theart relevance models. The browsing suggestions are assessed along two dimensions: relevancy and specifity. We present an account of per-topic results, which helps understand for what type of queries our feedback mechanism is particularly helpful.

[1]  Chris Buckley,et al.  Improving automatic query expansion , 1998, SIGIR '98.

[2]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[3]  Kevyn Collins-Thompson,et al.  Query expansion using random walk models , 2005, CIKM '05.

[4]  William R. Hersh,et al.  A comparative analysis of retrieval features used in the TREC 2006 Genomics Track passage retrieval task , 2007, AMIA.

[5]  Elmer V. Bernstam,et al.  A day in the life of PubMed: analysis of a typical day's query log. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[6]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[7]  Koraljka Golub,et al.  Browsing and searching behavior in the renardus web service a study based on log analysis , 2004, JCDL.

[8]  Ellen M. Voorhees,et al.  Using WordNet to disambiguate word senses for text retrieval , 1993, SIGIR.

[9]  Fernando Diaz,et al.  Improving the estimation of relevance models using large external corpora , 2006, SIGIR.

[10]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[11]  Tao Tao,et al.  Accurate language model estimation with document expansion , 2005, CIKM '05.

[12]  Luo Si,et al.  York University at TREC 2007: Genomics Track , 2005, TREC.

[13]  Carmel Domshlak,et al.  Better than the real thing?: iterative pseudo-query processing using cluster-based language models , 2005, SIGIR '05.

[14]  Gary Marsden,et al.  Facts and Myths of Browsing and Searching in a Digital Library , 1998, ECDL.

[15]  Jian-Yun Nie,et al.  Integrating word relationships into language models , 2005, SIGIR '05.

[16]  Gareth J. F. Jones,et al.  Applying summarization techniques for term selection in relevance feedback , 2001, SIGIR '01.

[17]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[18]  James Allan,et al.  Using part-of-speech patterns to reduce query ambiguity , 2002, SIGIR '02.

[19]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[20]  W. Bruce Croft,et al.  Predicting query performance , 2002, SIGIR '02.

[21]  Elad Yom-Tov,et al.  What makes a query difficult? , 2006, SIGIR.

[22]  Xianggui Qu,et al.  Multivariate Data Analysis , 2007, Technometrics.

[23]  Djoerd Hiemstra,et al.  A Linguistically Motivated Probabilistic Model of Information Retrieval , 1998, ECDL.