Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

Abstract The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie

[1]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[2]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[3]  Susan T. Dumais,et al.  Improving Web Search Ranking by Incorporating User Behavior Information , 2019, SIGIR Forum.

[4]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[5]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[6]  Masoud Rahgozar,et al.  A query term re-weighting approach using document similarity , 2016, Inf. Process. Manag..

[7]  Lucila Ohno-Machado,et al.  Information retrieval for biomedical datasets: the 2016 bioCADDIE dataset retrieval challenge , 2017, Database J. Biol. Databases Curation.

[8]  Michelle Dunn,et al.  The National Institutes of Health's Big Data to Knowledge (BD2K) initiative: capitalizing on biomedical big data , 2014, J. Am. Medical Informatics Assoc..

[9]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[10]  W. Bruce Croft,et al.  Linear feature-based models for information retrieval , 2007, Information Retrieval.

[11]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[12]  W. Bruce Croft,et al.  Query performance prediction in web search environments , 2007, SIGIR.

[13]  Lucila Ohno-Machado,et al.  A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge , 2017, Database J. Biol. Databases Curation.

[14]  Eugene Agichtein,et al.  ViewSer: enabling large-scale remote user studies of web search examination and interaction , 2011, SIGIR.

[15]  Xu Hua,et al.  bioCADDIE white paper - Data Discovery Index , 2015 .

[16]  Christian von Mering,et al.  RAIN: RNA–protein Association and Interaction Networks , 2017, Database J. Biol. Databases Curation.

[17]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[18]  Li Chen,et al.  Omicseq: a web-based search engine for exploring omics datasets , 2017, Nucleic Acids Res..

[19]  Gerard Salton,et al.  Improving retrieval performance by relevance feedback , 1997, J. Am. Soc. Inf. Sci..

[20]  W. Bruce Croft,et al.  Discovering key concepts in verbose queries , 2008, SIGIR '08.