Expanding Queries Using Multiple Resources (The AID Group at TREC 2006: Genomics Track)

We describe our participation in the TREC 2006 Genomics track, in which our main focus was on query expansion. We hypothe- sized that applying query expansion techniques would help us both to identify and retrieve syn- onymous terms, and to cope with ambiguity. To this end, we developed several collection-specific as well as web-based strategies. We also per- formed post-submission experiments, in which we compare various retrieval engines, such as Lucene, Indri, and Lemur, using a simple baseline topic- set. When indexing entire paragraphs as pseudo- documents, we find that Lemur is able to achieve the highest document-, passage-, and aspect-level scores, using the KL-divergence method and its default settings. Additionally, we index the col- lection at a lower level of granularity, by creating pseudo-documents comprising of individual sen- tences. When we search these instead of para- graphs in Lucene, the passage-level scores im- prove considerably. Finally we note that stemming improves overall scores by at least 10%.

[1]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[2]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[3]  Andrew B. Clegg,et al.  Evaluating and Integrating Treebank Parsers on a Biomedical Corpus , 2005, ACL 2005.

[4]  W. Bruce Croft,et al.  An exploratory analysis of phrases in text retrieval , 2000, RIAO.

[5]  Gilad Mishne,et al.  Boosting Web Retrieval through Query Operations , 2005, BNAIC.

[6]  Xie Kanglin Lucene Search Engine , 2007 .

[7]  Jaana Kekäläinen,et al.  The impact of query structure and query expansion on retrieval performance , 1998, SIGIR '98.

[8]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[9]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[10]  Djoerd Hiemstra,et al.  Bayesian extension to the language model for ad hoc information retrieval , 2003, SIGIR.

[11]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[12]  Luo Si,et al.  York University at TREC 2007: Genomics Track , 2005, TREC.

[13]  Maarten de Rijke,et al.  Length normalization in XML retrieval , 2004, SIGIR '04.

[14]  Djoerd Hiemstra,et al.  Twenty-One at TREC7: Ad-hoc and Cross-Language Track , 1998, TREC.

[15]  Chris Buckley Why current IR engines fail , 2004, SIGIR '04.

[16]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.