Natural Language Processing Methods for Enhancing Geographic Metadata for Phylogeography of Zoonotic Viruses

Zoonotic viruses represent emerging or re-emerging pathogens that pose significant public health threats throughout the world. It is therefore crucial to advance current surveillance mechanisms for these viruses through outlets such as phylogeography. Despite the abundance of zoonotic viral sequence data in publicly available databases such as GenBank, phylogeographic analysis of these viruses is often limited by the lack of adequate geographic metadata. However, many GenBank records include references to articles with more detailed information and automated systems may help extract this information efficiently and effectively. In this paper, we describe our efforts to determine the proportion of GenBank records with “insufficient” geographic metadata for seven well-studied viruses. We also evaluate the performance of four different Named Entity Recognition (NER) systems for automatically extracting related entities using a manually created gold-standard.

[1]  Son Doan,et al.  Classifying disease outbreak reports using n-grams and semantic features , 2009, Int. J. Medical Informatics.

[2]  W. Preiser Zoonoses: Infectious Diseases Transmissible from Animals to Humans, 3rd Ed. , 2007 .

[3]  John Avise Books Received , 2000, Heredity.

[4]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[5]  W. Fitch,et al.  Influenza A H5N1 Immigration Is Filtered Out at Some International Borders , 2008, PloS one.

[6]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[7]  K. Kerr,et al.  Zoonoses: infectious diseases transmissible from animals to humans , 2004, Journal of Clinical Pathology.

[8]  M. Suchard,et al.  Bayesian Phylogenetics with BEAUti and the BEAST 1.7 , 2012, Molecular biology and evolution.

[9]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[10]  Gloria Bordogna,et al.  Geographic information retrieval: Modeling uncertainty of user's context , 2012, Fuzzy Sets Syst..

[11]  M. Ciccozzi,et al.  Epidemiological history and phylogeography of West Nile virus lineage 2. , 2013, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[12]  Son Doan,et al.  Classifying Vietnamese disease outbreak reports with important sentences and rich features , 2012, SoICT '12.

[13]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[14]  P. Zanotto,et al.  Molecular phylogeography of tick-borne encephalitis virus in central Europe. , 2013, The Journal of general virology.

[15]  Alexei J. Drummond,et al.  Bayesian Phylogeography Finds Its Roots , 2009, PLoS Comput. Biol..

[16]  Kei-Hoi Cheung,et al.  Enhancing phylogeography by improving geographical information from GenBank , 2011, J. Biomed. Informatics.

[17]  H. Isenberg,et al.  Zoonoses: Infectious Diseases Transmissible from Animals to Humans, 3rd Edition , 2003 .

[18]  Angel X. Chang,et al.  SUTime: A library for recognizing and normalizing time expressions , 2012, LREC.

[19]  Rebecca R. Gray,et al.  Integrative molecular phylogeography in the context of infectious diseases on the human-animal interface , 2012, Parasitology.