Learning to Rank Figures within a Biomedical Article

Hundreds of millions of figures are available in biomedical literature, representing important biomedical experimental evidence. This ever-increasing sheer volume has made it difficult for scientists to effectively and accurately access figures of their interest, the process of which is crucial for validating research facts and for formulating or testing novel research hypotheses. Current figure search applications can't fully meet this challenge as the “bag of figures” assumption doesn't take into account the relationship among figures. In our previous study, hundreds of biomedical researchers have annotated articles in which they serve as corresponding authors. They ranked each figure in their paper based on a figure's importance at their discretion, referred to as “figure ranking”. Using this collection of annotated data, we investigated computational approaches to automatically rank figures. We exploited and extended the state-of-the-art listwise learning-to-rank algorithms and developed a new supervised-learning model BioFigRank. The cross-validation results show that BioFigRank yielded the best performance compared with other state-of-the-art computational models, and the greedy feature selection can further boost the ranking performance significantly. Furthermore, we carry out the evaluation by comparing BioFigRank with three-level competitive domain-specific human experts: (1) First Author, (2) Non-Author-In-Domain-Expert who is not the author nor co-author of an article but who works in the same field of the corresponding author of the article, and (3) Non-Author-Out-Domain-Expert who is not the author nor co-author of an article and who may or may not work in the same field of the corresponding author of an article. Our results show that BioFigRank outperforms Non-Author-Out-Domain-Expert and performs as well as Non-Author-In-Domain-Expert. Although BioFigRank underperforms First Author, since most biomedical researchers are either in- or out-domain-experts for an article, we conclude that BioFigRank represents an artificial intelligence system that offers expert-level intelligence to help biomedical researchers to navigate increasingly proliferated big data efficiently.

[1]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[2]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[3]  Hong Yu,et al.  Towards Answering Biological Questions with Experimental Evidence: Automatically Identifying Text that Summarize Image Content in Full-Text Articles , 2006, AMIA.

[4]  Hong Yu,et al.  Automatic Figure Ranking and User Interfacing for Intelligent Figure Search , 2010, PloS one.

[5]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[6]  Tie-Yan Liu,et al.  Listwise approach to learning to rank: theory and algorithm , 2008, ICML '08.

[7]  Frank K. Soong,et al.  A Comparative Study of Discriminative Methods for Reranking LVCSR N-Best Hypotheses in Domain Adaptation and Generalization , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[8]  Tin Wee Tan,et al.  Towards big data science in the decade ahead from ten years of InCoB and the 1st ISCB-Asia Joint Conference , 2011, BMC Bioinformatics.

[9]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[10]  Jie Yao,et al.  Searching online journals for fluorescence microscope images depicting protein subcellular location patterns , 2001, Proceedings 2nd Annual IEEE International Symposium on Bioinformatics and Bioengineering (BIBE 2001).

[11]  Edward M. Marcotte,et al.  Exploiting Big Biology: Integrating Large-scale Biological Data for Function Inference , 2001, Briefings Bioinform..

[12]  Finding correlations in big data , 2012, Nature Biotechnology.

[13]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[14]  Tony Cass,et al.  A Handler for Big Data , 1998, Science.

[15]  Preslav Nakov,et al.  BioText Search Engine: beyond abstract search , 2007, Bioinform..

[16]  Xiaoyan Zhu,et al.  GeneTUKit: a software for document-level gene normalization , 2011, Bioinform..

[17]  Hong Yu,et al.  FigSum: Automatically Generating Structured Text Summaries for Figures in Biomedical Literature , 2009, AMIA.

[18]  Fei Liu,et al.  A Supervised Framework for Keyword Extraction From Meeting Transcripts , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[19]  Harris Wu,et al.  Probabilistic question answering on the web , 2002, WWW '02.

[20]  M. Snir,et al.  Big data, but are we ready? , 2011, Nature Reviews Genetics.

[21]  G. Prasad LEARNING TO LINK ENTITIES WITH KNOWLEDGE BASE , 2016 .

[22]  Carol Peters,et al.  Cross-Language Evaluation Forum: Objectives, Results, Achievements , 2004, Information Retrieval.

[23]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[24]  Andrew McCallum,et al.  Piecewise pseudolikelihood for efficient training of conditional random fields , 2007, ICML '07.

[25]  Sanjeev Khudanpur,et al.  Forest Reranking for Machine Translation with the Perceptron Algorithm , 2009 .

[26]  Michael Krauthammer,et al.  Yale Image Finder (YIF): a new search engine for retrieving biomedical images , 2008, Bioinform..

[27]  Cheng Thao,et al.  GoldMiner: a radiology image search engine. , 2007, AJR. American journal of roentgenology.

[28]  Shih-Fu Chang,et al.  Exploring Text and Image Features to Classify Images in Bioscience Literature , 2006, BioNLP@NAACL-HLT.

[29]  Hong Yu,et al.  Are figure legends sufficient? Evaluating the contribution of associated text to biomedical figure comprehension , 2009, Journal of biomedical discovery and collaboration.

[30]  John M. Conroy,et al.  Beyond Captions: Linking Figures with Abstract Sentences in Biomedical Articles , 2012, PloS one.

[31]  Hang Li,et al.  AdaRank: a boosting algorithm for information retrieval , 2007, SIGIR.

[32]  Tao Qin,et al.  Query-level loss functions for information retrieval , 2008, Inf. Process. Manag..

[33]  Ramesh Nallapati,et al.  Discriminative models for information retrieval , 2004, SIGIR '04.

[34]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[35]  W. Bruce Croft,et al.  Improving the effectiveness of information retrieval with local context analysis , 2000, TOIS.

[36]  José Esparza,et al.  The discovery value of “Big Science” , 2007, The Journal of experimental medicine.

[37]  Filip Radlinski,et al.  A support vector method for optimizing average precision , 2007, SIGIR.

[38]  J. Mervis U.S. science policy. Agencies rally to tackle big data. , 2012, Science.

[39]  Aravind K. Joshi,et al.  Ranking and Reranking with Perceptron , 2005, Machine Learning.

[40]  Elizabeth Pennisi How Will Big Pictures Emerge From a Sea of Biological Data? , 2005, Science.

[41]  Hong Yu,et al.  Accessing bioscience images from abstract sentences , 2006, ISMB.

[42]  Hagit Shatkay,et al.  Integrating image data into biomedical text categorization , 2006, ISMB.

[43]  Feifan Liu,et al.  Unsupervised language model adaptation via topic modeling based on named entity hypotheses , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  George R. Thoma,et al.  Annotation and retrieval of clinically relevant images , 2009, Int. J. Medical Informatics.

[45]  Liang Huang,et al.  Forest Reranking: Discriminative Parsing with Non-Local Features , 2008, ACL.