Measuring the relative importance of full text sections for information retrieval from scientific literature.

With the growing availability of full-text articles, integrating abstracts and full texts of documents into a unified representation is essential for comprehensive search of scientific literature. However, previous studies have shown that naïvely merging abstracts with full texts of articles does not consistently yield better performance. Balancing the contribution of query terms appearing in the abstract and in sections of different importance in full text articles remains a challenge both with traditional bag-of-words IR approaches and for neural retrieval methods. In this work we establish the connection between the BM25 score of a query term appearing in a section of a full text document and the probability of that document being clicked or identified as relevant. Probability is computed using Pool Adjacent Violators (PAV), an isotonic regression algorithm, providing a maximum likelihood estimate based on the observed data. Using this probabilistic transformation of BM25 scores we show an improved performance on the PubMed Click dataset developed and presented in this study, as well as the 2007 TREC Genomics collection.

[1]  William R. Hersh,et al.  A comparative analysis of system features used in the TREC-COVID information retrieval challenge , 2020, Journal of Biomedical Informatics.

[2]  Zhiyong Lu,et al.  MeSH-based dataset for measuring the relevance of text retrieval , 2018, BioNLP.

[3]  Christian Plaunt,et al.  Subtopic structuring for full-length document access , 1993, SIGIR.

[4]  Zhiyong Lu,et al.  PMC text mining subset in BioC: about three million full-text articles and growing , 2019, Bioinform..

[5]  A Resnick,et al.  Relative Effectiveness of Document Titles and Abstracts for Determining Relevance of Documents , 1961, Science.

[6]  Roi Blanco,et al.  Finding support sentences for entities , 2010, SIGIR.

[7]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[8]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[9]  Ellen M. Voorhees,et al.  The Philosophy of Information Retrieval Evaluation , 2001, CLEF.

[10]  Donald C. Comeau,et al.  LitSense: making sense of biomedical literature at sentence level , 2019, Nucleic Acids Res..

[11]  Xu Han,et al.  Extending the evaluation of Genia Event task toward knowledge base construction and comparison to Gene Regulation Ontology task , 2015, BMC Bioinformatics.

[12]  Burkhard Rost,et al.  tagtog: interactive and text-mining-assisted annotation of gene mentions in PLOS full-text articles , 2014, Database J. Biol. Databases Curation.

[13]  Zhiyong Lu,et al.  How user intelligence is improving PubMed , 2018, Nature Biotechnology.

[14]  Sarah Kuester,et al.  Smoothing Techniques With Implementation In S , 2016 .

[15]  K. Bretonnel Cohen,et al.  The structural and content aspects of abstracts versus bodies of full text journal articles are different , 2010, BMC Bioinformatics.

[16]  Senay Kafkas,et al.  Section level search functionality in Europe PMC , 2015, J. Biomed. Semant..

[17]  W. John Wilbur,et al.  The Synergy Between PAV and AdaBoost , 2005, Machine Learning.

[18]  Jimmy J. Lin,et al.  Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations , 2021, ArXiv.

[19]  Jimmy J. Lin Is searching full text more effective than searching abstracts? , 2009, BMC Bioinformatics.

[20]  W. Bruce Croft,et al.  A Deep Look into Neural Ranking Models for Information Retrieval , 2019, Inf. Process. Manag..

[21]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[22]  Robert Leaman,et al.  PubTator central: automated concept annotation for biomedical full text articles , 2019, Nucleic Acids Res..

[23]  Zhiyong Lu,et al.  Best Match: New relevance search for PubMed , 2018, PLoS biology.

[24]  Said Ouatik El Alaoui,et al.  A passage retrieval method based on probabilistic information retrieval model and UMLS concepts in biomedical question answering , 2017, J. Biomed. Informatics.

[25]  Jimmy J. Lin,et al.  Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset , 2020, SDP.

[26]  Søren Brunak,et al.  A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts , 2018, PLoS Comput. Biol..