Passage relevance models for genomics search

We present a passage relevance model for integrating semantic and statistical evidence of biomedical concepts and topics in context using the framework of a probabilistic graphical model. Component models of topics, concepts, terms, and document are represented as potential functions within a Markov Random Field, and the probability of a passage being relevant to a biologist's information need is represented as the joint distribution across all potential functions. Relevance model feedback of top ranked passages is used to improve distributional estimates of concepts and topics in context, and a dimensional indexing strategy is used for efficient aggregation of concept and term statistics. By integrating multiple sources of evidence including dependencies between topics, concepts, and terms, we seek to improve genomics literature passage retrieval precision. Using this model, we demonstrate statistically significant improvements in retrieval precision using a large genomics literature corpus.

[1]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[2]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[3]  J. Q. Smith,et al.  1. Bayesian Statistics 4 , 1993 .

[4]  Jimmy J. Lin,et al.  Quantitative evaluation of passage retrieval algorithms for question answering , 2003, SIGIR.

[5]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[6]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[7]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[8]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[9]  Michael I. Jordan,et al.  Hierarchical Bayesian Models for Applications in Information Retrieval , 2003 .

[10]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[11]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[12]  Adwait Ratnaparkhi,et al.  IBM's Statistical Question Answering System , 2000, TREC.

[13]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[14]  Patrick Ruch,et al.  Combining Resources to Find Answers to Biomedical Questions , 2007, TREC.

[15]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[16]  Ophir Frieder,et al.  Probabilistic passage models for semantic search of genomics literature , 2008 .

[17]  Ophir Frieder,et al.  Combining Semantics, Context, and Statistical Evidence in Genomics Literature Search , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[18]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[19]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[20]  S. Robertson The probability ranking principle in IR , 1997 .

[21]  Jimmy J. Lin The Role of Information Retrieval in Answering Complex Questions , 2006, ACL.

[22]  Ophir Frieder,et al.  IIT TREC 2007 Genomics Track: Using Concept-Based Semantics in Context for Genomics Literature Passage Retrieval , 2007, TREC.

[23]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[24]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[25]  Clement T. Yu,et al.  TREC Genomics Track at UIC , 2007, TREC.

[26]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[27]  Salim Roukos,et al.  IBM's Statistical Question Answering System-TREC 11 , 2001, TREC.

[28]  Ophir Frieder,et al.  Integrating structured data and text: a relational approach , 1997 .

[29]  Ophir Frieder,et al.  IIT TREC 2006: Genomics Track , 2006, TREC.

[30]  Ralph Kimball,et al.  The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses , 1996 .

[31]  L. Azzopardi,et al.  Topic based language models for ad hoc information retrieval , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).