Probabilistic passage models for semantic search of genomics literature

We explore unsupervised learning techniques for extracting semantic information about biomedical concepts and topics, and introduce a passage retrieval model for using these semantics in context to improve genomics literature search. Our contributions include a new passage retrieval model based on an undirected graphical model (Markov Random Fields), and new methods for modeling passage-concepts, document-topics, and passage-terms as potential functions within the model. Each potential function includes distributional evidence to disambiguate topics, concepts, and terms in context. The joint distribution across potential functions in the graph represents the probability of a passage being relevant to a biologist's information need. Relevance ranking within each potential function simplifies normalization across potential functions and eliminates the need for tuning of passage retrieval model parameters. Our dimensional indexing model facilitates efficient aggregation of topic, concept, and term distributions. The proposed passage-retrieval model improves search results in the presence of varying levels of semantic evidence, outperforming models of query terms, concepts, or document topics alone. Our results exceed the state-of-the-art for automatic document retrieval by 14.46% (0.3554 vs. 0.3105) and passage retrieval by 15.57% (0.1128 vs. 0.0976) as assessed by the TREC 2007 Genomics Track, and automatic document retrieval by 18.56% (0.3424 vs. 0.2888) as assessed by the TREC 2005 Genomics Track. Automatic document retrieval results for TREC 2007 and TREC 2005 are statistically significant at the 95% confidence level (p = .0359 and .0253, respectively). Passage retrieval is significant at the 90% confidence level (p = 0.0893). © 2008 Wiley Periodicals, Inc.

[1]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[2]  Ophir Frieder,et al.  Combining Semantics, Context, and Statistical Evidence in Genomics Literature Search , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[3]  Henry Tirri,et al.  A Scalable Topic-Based Open Source Search Engine , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[4]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[5]  Luis M. de Campos,et al.  An information retrieval model based on simple Bayesian networks , 2003, Int. J. Intell. Syst..

[6]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[7]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[10]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[11]  Yike Guo,et al.  Enabling more sophisticated gene expression analysis for understanding diseases and optimizing treatments , 2007, SKDD.

[12]  Michael I. Jordan,et al.  Hierarchical Bayesian Models for Applications in Information Retrieval , 2003 .

[13]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[14]  Thomas L. Griffiths,et al.  A probabilistic approach to semantic representation , 2019, Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society.

[15]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[16]  Jimmy J. Lin,et al.  Quantitative evaluation of passage retrieval algorithms for question answering , 2003, SIGIR.

[17]  Peter Buneman,et al.  Challenges in Integrating Biological Data Sources , 1995, J. Comput. Biol..

[18]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[19]  Justin Zobel,et al.  Passage retrieval revisited , 1997, SIGIR '97.

[20]  W. Bruce Croft,et al.  The use of phrases and structured queries in information retrieval , 1991, SIGIR '91.

[21]  David A. Fenstermacher,et al.  Introduction to bioinformatics , 2005, J. Assoc. Inf. Sci. Technol..

[22]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  S. Robertson The probability ranking principle in IR , 1997 .

[24]  Jian Pei,et al.  Introduction to the special issue on data mining for health informatics , 2007, SKDD.

[25]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[26]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[27]  Jimmy J. Lin The Role of Information Retrieval in Answering Complex Questions , 2006, ACL.

[28]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[29]  Berthier A. Ribeiro-Neto,et al.  A belief network model for IR , 1996, SIGIR '96.

[30]  Ophir Frieder,et al.  Integrating structured data and text: a relational approach , 1997 .

[31]  Berthier A. Ribeiro-Neto,et al.  Link-based and content-based evidential information in a belief network model , 2000, SIGIR '00.

[32]  W. John MacMullen,et al.  Information problems in molecular biology and bioinformatics , 2005, J. Assoc. Inf. Sci. Technol..

[33]  Patrick Ruch,et al.  Combining Resources to Find Answers to Biomedical Questions , 2007, TREC.

[34]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[35]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[36]  W. Bruce Croft,et al.  The INQUERY Retrieval System , 1992, DEXA.

[37]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.