RankPref: Ranking Sentences Describing Relations between Biomedical Entities with an Application

This paper presents a machine learning approach that selects and, more generally, ranks sentences containing clear relations between genes and terms that are related to them. This is treated as a binary classification task, where preference judgments are used to learn how to choose a sentence from a pair of sentences. Features to capture how the relationship is described textually, as well as how central the relationship is in the sentence, are used in the learning process. Simplification of complex sentences into simple structures is also applied for the extraction of the features. We show that such simplification improves the results by up to 13%. We conducted three different evaluations and we found that the system significantly outperforms the baselines.

[1]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[2]  Miguel A. Andrade-Navarro,et al.  Ranking the whole MEDLINE database according to a large training set using text indexing , 2005, BMC Bioinformatics.

[3]  W. John Wilbur,et al.  Text Mining Techniques for Leveraging Positively Labeled Data , 2011, BioNLP@ACL.

[4]  William R. Hersh,et al.  A comparative analysis of retrieval features used in the TREC 2006 Genomics Track passage retrieval task , 2007, AMIA.

[5]  Siddhartha Jonnalagadda,et al.  Towards Effective Sentence Simplification for Automatic Processing of Biomedical Text , 2009, HLT-NAACL.

[6]  Zhiyong Lu,et al.  Extraction of data deposition statements from the literature: a method for automatically tracking research results , 2011, Bioinform..

[7]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[8]  Yoram Singer,et al.  Learning to Order Things , 1997, NIPS.

[9]  Vetle I. Torvik,et al.  Anne O'Tate: A tool to support user-driven summarization, drill-down and browsing of PubMed search results , 2008, Journal of biomedical discovery and collaboration.

[10]  Carl J. Schmidt,et al.  eGIFT: Mining Gene Information from the Literature , 2010, BMC Bioinformatics.

[11]  William R. Hersh,et al.  Evaluation of a gene information summarization system by users during the analysis process of microarray datasets , 2009, BMC Bioinformatics.

[12]  Yue Lu,et al.  An empirical study of gene synonym query expansion in biomedical information retrieval , 2008, Information Retrieval.

[13]  Filip Radlinski,et al.  Query chains: learning to rank from implicit feedback , 2005, KDD '05.

[14]  Raman Chandrasekar,et al.  Motivations and Methods for Text Simplification , 1996, COLING.

[15]  Xin He,et al.  Generating gene summaries from biomedical literature: A study of semi-structured summarization , 2007, Inf. Process. Manag..

[16]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[17]  Piotr Zielenkiewicz,et al.  e-LiSe - an online tool for finding needles in the "(Medline) haystack" , 2008, Bioinform..

[18]  Piotr Zielenkiewicz,et al.  The High Throughput Sequence Annotation Service (HT-SAS) – the shortcut from sequence to true Medline words , 2009, BMC Bioinformatics.

[19]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[20]  Xiaojin Zhu,et al.  Ranking Biomedical Passages for Relevance and Diversity: University of Wisconsin, Madison at TREC Genomics 2006 , 2006, TREC.

[21]  David Maxwell Chickering,et al.  Here or there: preference judgments for relevance , 2008 .

[22]  Dietrich Rebholz-Schuhmann,et al.  MedEvi: Retrieving textual evidence of relations between biomedical concepts from Medline , 2008, Bioinform..

[23]  Zhiyong Lu,et al.  Towards Automatic Generation of Gene Summary , 2009, BioNLP@HLT-NAACL.

[24]  William R. Hersh,et al.  TREC GENOMICS Track Overview , 2003, TREC.

[25]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[26]  Advaith Siddharthan,et al.  Syntactic Simplification and Text Cohesion , 2006 .