Passage extraction and result combination for genomics information retrieval

In this paper, we first propose algorithms for passage extraction to build indices for the purpose of generating more accurate passages as query answers. Second, we propose a basic result combination method and an improved result combination method to combine the retrieved results from different indices for the purpose of selecting and merging relevant passages as outputs. For passage extraction, three new algorithms are proposed, namely paragraphParsed, sentenceParsed and wordSentenceParsed. For result combination, a novel method is proposed, in which we use factor analysis to generate a better baseline result for combination by finding some hidden common factors that can be used to estimate the importance of keywords and keyword associations. Finally, we report the experimental results that confirm the effectiveness and superiority of the factor analysis based method for result combination. Our proposed approaches achieve excellent results on the TREC 2006 and 2007 Genomics data sets, which provide a promising avenue for constructing high performance information retrieval systems in biomedicine.

[1]  Mario Fernando Montenegro Campos,et al.  An image retrieval method based on factor analysis , 2003, 16th Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI 2003).

[2]  Luo Si,et al.  Discriminative probabilistic models for passage based retrieval , 2008, SIGIR '08.

[3]  Ming Zhong,et al.  Concept-based biomedical text retrieval , 2006, SIGIR '06.

[4]  Elizabeth A. Peck,et al.  Introduction to Linear Regression Analysis , 2001 .

[5]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[6]  Hsin-Hsi Chen,et al.  A study of learning a merge model for multilingual information retrieval , 2008, SIGIR '08.

[7]  Thomas Mandl Efficient Preprocessing for Information Retrieval with Neural Networks , 1999 .

[8]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[9]  Clement T. Yu,et al.  Knowledge-intensive conceptual retrieval and passage extraction of biomedical literature , 2007, SIGIR.

[10]  ChengXiang Zhai,et al.  An empirical study of tokenization strategies for biomedical information retrieval , 2007, Information Retrieval.

[11]  K. Jöreskog,et al.  Applied Factor Analysis in the Natural Sciences. , 1997 .

[12]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[13]  Jammalamadaka Introduction to Linear Regression Analysis (3rd ed.) , 2003 .

[14]  C. Subbarao,et al.  Characterization of groundwater contamination using factor analysis , 1996 .

[15]  Stephen E. Robertson,et al.  Okapi at TREC-5 , 1996, TREC.

[16]  Norbert Fuhr,et al.  Probabilistic information retrieval as a combination of abstraction, inductive learning, and probabilistic assumptions , 1994, TOIS.

[17]  Xiangji Huang,et al.  A dual index model for contextual information retrieval , 2005, SIGIR '05.

[18]  Thomas Mandl Das COSIMIR-Modell für Information Retrieval mit neuronalen Netzen , 1999, Datenbank Rundbr..

[19]  Luo Si,et al.  York University at TREC 2007: Genomics Track , 2005, TREC.

[20]  Marcel Worring,et al.  NIST Special Publication , 2005 .

[21]  Stephen E. Robertson,et al.  Applying Machine Learning to Text Segmentation for Information Retrieval , 2004, Information Retrieval.

[22]  Xiaohua Hu,et al.  Context-sensitive semantic smoothing for the language modeling approach to genomic IR , 2006, SIGIR.