Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts

Abstract The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers’ queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval. Database URL: https://github.com/yanshanwang/biocaddie2016mayodata

[1]  Erik Schultes,et al.  The FAIR Guiding Principles for scientific data management and stewardship , 2016, Scientific Data.

[2]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[3]  Carole A. Goble,et al.  Towards BioDBcore: a community-defined information specification for biological databases , 2011, Database : the journal of biological databases and curation.

[4]  João Magalhães,et al.  NovaSearch at TREC 2015 Clinical Decision Support Track , 2015, TREC.

[5]  Hongfang Liu,et al.  MayoNLPTeam at the 2016 CLEF eHealth Information Retrieval Task 1 , 2016, CLEF.

[6]  Hongfang Liu,et al.  BELMiner: adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences , 2017, Database J. Biol. Databases Curation.

[7]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[8]  Michalis Vazirgiannis,et al.  AUEB at TREC 2015: Clinical Decision Support Track , 2015, TREC.

[9]  Guoqian Jiang,et al.  Harmonizing bioCADDIE Metadata Schemas for Indexing Clinical Research Datasets Using Semantic Web Technologies , 2016, BMDID@ISWC.

[10]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[11]  Wei Hu,et al.  BioSearch: a semantic search engine for Bio2RDF , 2017, Database J. Biol. Databases Curation.

[12]  Edward H. Shortliffe,et al.  Viewpoint: The Unified Medical Language System: Toward a Collaborative Approach for Solving Terminologic Problems , 1998, J. Am. Medical Informatics Assoc..

[13]  F. Collins,et al.  Policy: NIH plans to enhance reproducibility , 2014, Nature.

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[16]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[17]  Lucila Ohno-Machado,et al.  Information retrieval for biomedical datasets: the 2016 bioCADDIE dataset retrieval challenge , 2017, Database J. Biol. Databases Curation.

[18]  Hongfang Liu,et al.  A Part-Of-Speech term weighting scheme for biomedical information retrieval , 2016, J. Biomed. Informatics.

[19]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[20]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[21]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[22]  Stephen E. Robertson,et al.  GatfordCentre for Interactive Systems ResearchDepartment of Information , 1996 .

[23]  John D. Lafferty,et al.  Model-based feedback in the language modeling approach to information retrieval , 2001, CIKM '01.

[24]  Emine Yilmaz,et al.  A simple and efficient sampling method for estimating AP and NDCG , 2008, SIGIR '08.

[25]  Peter Li,et al.  Experiences in integrated data and research object publishing using GigaDB , 2017, International Journal on Digital Libraries.

[26]  Anupama E. Gururaj,et al.  Finding useful data across multiple biomedical data repositories using DataMed , 2017, Nature Genetics.

[27]  José Luís Oliveira,et al.  BeCAS: biomedical concept recognition services and visualization , 2013, Bioinform..

[28]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[29]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[30]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[32]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[33]  Ellen M. Voorhees,et al.  Overview of the TREC 2014 Clinical Decision Support Track , 2014, TREC.

[34]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[35]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[36]  Lucila Ohno-Machado,et al.  A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge , 2017, Database J. Biol. Databases Curation.

[37]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[38]  Philip E. Bourne,et al.  The NIH Big Data to Knowledge (BD2K) initiative , 2015, J. Am. Medical Informatics Assoc..

[39]  Avi Arampatzis,et al.  DUTH at TREC 2015 Clinical Decision Support Track , 2015, TREC.

[40]  Luca Toldo,et al.  Semi-Supervised Information Retrieval System for Clinical Decision Support , 2016, TREC.

[41]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[42]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[43]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[44]  Allan Hanbury,et al.  TUW @ TREC Clinical Decision Support Track , 2014, TREC.

[45]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[46]  Karin M. Verspoor,et al.  Multi-field query expansion is effective for biomedical dataset retrieval , 2017, Database J. Biol. Databases Curation.

[47]  Nick Craswell,et al.  Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[48]  Carsten Eickhoff,et al.  ETH Zurich at TREC Clinical Decision Support 2016 , 2016, TREC.

[49]  David Buttler,et al.  Latent topic feedback for information retrieval , 2011, KDD.

[50]  F. Collins,et al.  NIH plans to enhance reproducibility , 2014 .

[51]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[52]  Hualong Zhang,et al.  NKU at TREC 2016: Clinical Decision Support Track , 2016, TREC.

[53]  Peter Cotroneo,et al.  Elsevier’s approach to the bioCADDIE 2016 Dataset Retrieval Challenge , 2017, Database J. Biol. Databases Curation.

[54]  Ellen M. Voorhees,et al.  State-of-the-art in biomedical literature retrieval for clinical cases: a survey of the TREC 2014 CDS track , 2016, Information Retrieval Journal.

[55]  Yue Zhang,et al.  CCNU at TREC 2016 Clinical Decision Support Track , 2016, TREC.

[56]  C. Mattingly,et al.  The Comparative Toxicogenomics Database (CTD). , 2003, Environmental health perspectives.

[57]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[58]  In-Chan Choi,et al.  Indexing by Latent Dirichlet Allocation and an Ensemble Model , 2013, J. Assoc. Inf. Sci. Technol..