Named entity linking of geospatial and host metadata in GenBank for advancing biomedical research

Abstract GenBank is a popular National Center for Biotechnology Information (NCBI) database for submission and analysis of DNA sequences for biomedical research. The resource is part of the Entrez environment which enables for cross-linking of concepts and entries in other participating NCBI databases such as Taxonomy, PubMed and Protein. For example, a GenBank record of an influenza A hemagglutinin gene DNA sequence might have a link to the Taxonomy database for the organism, a link to the related article in PubMed (if published) and a link to the Protein entry for the hemagglutinin protein. Despite its importance in biomedical research such as population genetics, phylogeography and public health surveillance, the host and geospatial metadata of genetic sequences in GenBank are not linked to any database. Therefore, to facilitate biomedical research based on georeferenced DNA sequences and/or DNA sequences with normalized host names, we designed and developed a framework that enriches GenBank entries by linking their host metadata to the NCBI Taxonomy database and their geospatial metadata to a comprehensive knowledge base of geographic locations called GeoNames. Here, we introduce a database created through the application of this framework to virus sequences in GenBank, and evaluate our normalization algorithms on a set of manually annotated records pertaining to viruses. Although currently applied to viruses, our framework can be easily extended to other organisms, and we discuss the potential utilization of our resource for biomedical research. Database URL: https://zodo.asu.edu/zoophydb/

[1]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[2]  Jochen L. Leidner Toponym resolution in text: annotation, evaluation and applications of spatial grounding , 2007, SIGF.

[3]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[4]  Víctor de Lorenzo,et al.  EnvMine: A text-mining system for the automatic extraction of contextual information , 2010, BMC Bioinformatics.

[5]  Enrique García Jordá,et al.  Mus musculus. , 2020, Clinical & translational oncology : official publication of the Federation of Spanish Oncology Societies and of the National Cancer Institute of Mexico.

[6]  Chitta Baral,et al.  Pacific Symposium on Biocomputing 15:465-476(2010) SYNTHESIS OF PHARMACOKINETIC PATHWAYS THROUGH KNOWLEDGE ACQUISITION AND AUTOMATED REASONING , 2022 .

[7]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[8]  Indra Neil Sarkar,et al.  Leveraging biomedical ontologies and annotation services to organize microbiome data from Mammalian hosts. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[9]  Marc A Suchard,et al.  Three roads diverged? Routes to phylogeographic inference. , 2010, Trends in ecology & evolution.

[10]  Ying Cheng,et al.  The European Nucleotide Archive , 2010, Nucleic Acids Res..

[11]  Elizabeth S. Chen,et al.  Towards Structuring Unstructured GenBank Metadata for Enhancing Comparative Biological Studies , 2011, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[12]  Matthew Scotch,et al.  Phylogeography of influenza A H5N1 clade 2.2.1.1 in Egypt , 2013, BMC Genomics.

[13]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[14]  C. Howard,et al.  Emerging virus diseases: can we ever expect the unexpected? , 2012, Emerging Microbes & Infections.

[15]  Christopher G. Chute,et al.  The National Center for Biomedical Ontology , 2012, J. Am. Medical Informatics Assoc..

[16]  Tatiana A. Tatusova,et al.  BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata , 2011, Nucleic Acids Res..

[17]  Thomas E. Besser,et al.  Phylogeny of Shiga Toxin-Producing Escherichia coli O157 Isolated from Cattle and Clinically Ill Humans , 2012, Molecular biology and evolution.

[18]  Wei Shen,et al.  LIEGE:: link entities in web lists with knowledge base , 2012, KDD.

[19]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[20]  Jian Pei,et al.  Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining , 2012, KDD 2012.

[21]  Kwang-Hyun Cho,et al.  Encyclopedia of Systems Biology , 2013, Springer New York.

[22]  Christopher G Chute,et al.  PharmGKB Drug Data Normalization with NDF-RT , 2013, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[23]  Barry Smith,et al.  The environment ontology: contextualising biological and biomedical entities , 2013, Journal of Biomedical Semantics.

[24]  Deepak Sharma,et al.  Unraveling the Web of Viroinformatics: Computational Tools and Databases in Virus Research , 2014, Journal of Virology.

[25]  Kirsten A. Duda,et al.  Global spread of dengue virus types: mapping the 70 year history , 2014, Trends in microbiology.

[26]  Michael Bada,et al.  Mapping of biomedical text to concepts of lexicons, terminologies, and ontologies. , 2014, Methods in molecular biology.

[27]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[28]  Yiming Bao,et al.  NCBI Viral Genomes Resource , 2014, Nucleic Acids Res..

[29]  Robert Rivera,et al.  Knowledge-driven geospatial location resolution for phylogeographic models of virus migration , 2015, Bioinform..

[30]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[31]  Nigel Collier,et al.  Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation , 2016, ACL.

[32]  Lars Juhl Jensen,et al.  Seqenv: linking sequences to environments through text mining , 2016, PeerJ.

[33]  Toshihisa Takagi,et al.  DNA data bank of Japan (DDBJ) progress report , 2015, Nucleic Acids Res..

[34]  Robert Rivera,et al.  A high-precision rule-based extraction system for expanding geospatial metadata in GenBank records , 2016, J. Am. Medical Informatics Assoc..

[35]  Hjalmar S. Kühl,et al.  A world of sequences: can we use georeferenced nucleotide databases for a robust automated phylogeography? , 2017 .

[36]  Alejandro A. Schäffer,et al.  Virus Variation Resource – improved response to emergent viral outbreaks , 2016, Nucleic Acids Res..