IIT TREC 2007 Genomics Track: Using Concept-Based Semantics in Context for Genomics Literature Passage Retrieval

For the TREC-2007 Genomics Track [1], we explore unsupervised techniques for extracting semantic information about biomedical concepts with a retrieval model for using these semantics in context to improve passage retrieval precision. Dependency grammar analysis is evaluated for boosting the rank of passages where complementary subject/object concept pairs can be identified between queries and sentences from candidate passages. In our model, a concept is represented as a set of synonymous terms and a concept-word distribution. Concept terms are identified using an information extraction technique relying on shallow sentence parsing, external knowledge sources, and document context. The system combines a dimensional data model for indexing scientific literature at multiple levels of document context, with a rule-based query processing algorithm. The data model consists of two hierarchical indices: one for individual words and a second for extracted concepts. The word index provides retrieval of single or multi-word terms. The concept index provides efficient retrieval of single or multiple independent concepts. A retrieval function combines concepts with term statistics at multiple levels of context to identify relevant passages. Finally, we boost the relevance score of sentences identified within a passage where we can identify term dependencies that complement subject/object pairs between query and passage sentences via dependency grammar analysis. Our objective for this year’s forum was to improve passage retrieval precision. We submitted three automatically generated results for three variations of our retrieval model to the TREC forum. The three results exceeded the track median for character based passage retrieval by 75 to 93%. The mean average precision (MAP) for our top passage retrieval model was 0.0940 which compares favorably to the top result of 0.0976.

[1]  James P. Callan,et al.  Passage-level evidence in document retrieval , 1994, SIGIR '94.

[2]  Ralph Kimball,et al.  The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses , 1996 .

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[5]  Hugo Zaragoza,et al.  Information Retrieval: Algorithms and Heuristics , 2002, Information Retrieval.

[6]  Marcel Worring,et al.  NIST Special Publication , 2005 .

[7]  Ophir Frieder,et al.  Integrating structured data and text: a relational approach , 1997 .

[8]  Stephen E. Robertson,et al.  Okapi/Keenbow at TREC-8 , 1999, TREC.

[9]  Ophir Frieder,et al.  IIT TREC 2006: Genomics Track , 2006, TREC.

[10]  Nazli Goharian,et al.  A Relational Genomics Search Engine , 2006, BIOCOMP.

[11]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[12]  Adwait Ratnaparkhi,et al.  IBM's Statistical Question Answering System , 2000, TREC.

[13]  Salim Roukos,et al.  IBM's Statistical Question Answering System-TREC 11 , 2001, TREC.

[14]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[15]  Justin Zobel,et al.  Effective ranking with arbitrary passages , 2001 .

[16]  Ellen M. Voorhees,et al.  Proceedings of the Fourteenth Text REtrieval Conference, TREC 2005, Gaithersburg, Maryland, USA, November 15-18, 2005 , 2005, NIST Special Publication.

[17]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[18]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.