Mining locus tags in PubMed Central to improve microbial gene annotation

BackgroundThe scientific literature contains millions of microbial gene identifiers within the full text and tables, but these annotations rarely get incorporated into public sequence databases. We propose to utilize the Open Access (OA) subset of PubMed Central (PMC) as a gene annotation database and have developed an R package called pmcXML to automatically mine and extract locus tags from full text, tables and supplements.ResultsWe mined locus tags from 1835 OA publications in ten microbial genomes and extracted tags mentioned in 30,891 sentences in main text and 20,489 rows in tables. We identified locus tag pairs marking the start and end of a region such as an operon or genomic island and expanded these ranges to add another 13,043 tags. We also searched for locus tags in supplementary tables and publications outside the OA subset in Burkholderia pseudomallei K96243 for comparison. There were 168 publications containing 48,470 locus tags and 83% of mentions were from supplementary materials and 9% from publications outside the OA subset.ConclusionsB. pseudomallei locus tags within the full text and tables of OA publications represent only a small fraction of the total mentions in the literature. For microbial genomes with very few functionally characterized proteins, the locus tags mentioned in supplementary tables and within ranges like genomic islands contain the majority of locus tags. Significantly, the functions in the R package provide access to additional resources in the OA subset that are not currently indexed or returned by searching PMC.

[1]  Jee-Hyub Kim,et al.  Database Citation in Full Text Biomedical Articles , 2013, PloS one.

[2]  Joseph L. Gabbard,et al.  PATRIC: the Comprehensive Bacterial Bioinformatics Resource with a Focus on Human Pathogenic Species , 2011, Infection and Immunity.

[3]  T. Tatusova,et al.  Solving the Problem: Genome Annotation Standards before the Data Deluge , 2011, Standards in genomic sciences.

[4]  Rick L. Stevens,et al.  National Institute of Allergy and Infectious Diseases Bioinformatics Resource Centers: New Assets for Pathogen Informatics , 2007, Infection and Immunity.

[5]  Joanne M Stevens,et al.  Identification of a bacterial factor required for actin‐based motility of Burkholderia pseudomallei , 2005, Molecular microbiology.

[6]  Michael E Wall,et al.  Consistency of gene starts among Burkholderia genomes , 2011, BMC Genomics.

[7]  Alfonso Valencia,et al.  How to link ontologies and protein–protein interactions to literature: text-mining approaches and the BioCreative experience , 2012, Database J. Biol. Databases Curation.

[8]  Winston A Hide,et al.  Big data: The future of biocuration , 2008, Nature.

[9]  D. Rebholz-Schuhmann,et al.  Text-mining solutions for biomedical research: enabling integrative biology , 2012, Nature Reviews Genetics.

[10]  Mark S. Thomas,et al.  In vivo expression technology identifies a type VI secretion system locus in Burkholderia pseudomallei that is induced upon invasion of macrophages. , 2007, Microbiology.

[11]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[12]  M. Stevens,et al.  Autotransporters and Their Role in the Virulence of Burkholderia pseudomallei and Burkholderia mallei , 2011, Front. Microbio..

[13]  I-Min A. Chen,et al.  The integrated microbial genomes system: an expanding comparative analysis resource , 2009, Nucleic Acids Res..

[14]  Casey M. Bergman,et al.  Annotating genes and genomes with DNA sequences extracted from biomedical articles , 2011, Bioinform..

[15]  M. Schell,et al.  Comparative Genomics and an Insect Model Rapidly Identify Novel Virulence Genes of Burkholderia mallei , 2008, Journal of bacteriology.

[16]  Rolf Apweiler,et al.  Linking publication, gene and protein data , 2006, Nature Cell Biology.

[17]  Richard Van Noorden Trouble at the text mine , 2012, Nature.

[18]  M. Gerner,et al.  pubmed2ensembl: A Resource for Mining the Biological Literature on Genes , 2011, PloS one.

[19]  L. Grivell,et al.  Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[20]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[21]  Samuel I. Miller,et al.  Structure of a Burkholderia pseudomallei Trimeric Autotransporter Adhesin Head , 2010, PloS one.

[22]  Zhiyong Lu,et al.  Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE , 2012, Database J. Biol. Databases Curation.

[23]  Seth Schobel,et al.  Pathema: a clade-specific bioinformatics resource center for pathogen research , 2010, Nucleic Acids Res..

[24]  Philip E. Bourne,et al.  Will a Biological Database Be Different from a Biological Journal? , 2005, PLoS Comput. Biol..

[25]  Raymond Lo,et al.  The Burkholderia Genome Database: facilitating flexible queries and comparative analyses , 2008, Bioinform..

[26]  A. Talaat,et al.  Genomic Islands as a Marker to Differentiate between Clinical and Environmental Burkholderia pseudomallei , 2012, PloS one.

[27]  Wing-Kin Sung,et al.  A Genomic Survey of Positive Selection in Burkholderia pseudomallei Provides Insights into the Evolution of Accidental Virulence , 2010, PLoS pathogens.

[28]  Senay Kafkas,et al.  Database citation in supplementary data linked to Europe PubMed Central full text biomedical articles , 2015, J. Biomed. Semant..

[29]  J. Mrázek,et al.  Type VI secretion is a major virulence determinant in Burkholderia mallei , 2007, Molecular microbiology.

[30]  D. Rice,et al.  A Burkholderia pseudomallei Toxin Inhibits Helicase Activity of Translation Factor eIF4A , 2011, Science.

[31]  F. Taieb,et al.  Cycle Inhibiting Factors (CIFs) Are a Growing Family of Functional Cyclomodulins Present in Invertebrate and Mammal Bacterial Pathogens , 2009, PloS one.

[32]  Eloisa Vargiu,et al.  Literature Retrieval and Mining in Bioinformatics: State of the Art and Challenges , 2012, Adv. Bioinformatics.

[33]  Rachel Balder,et al.  Identification of Burkholderia mallei and Burkholderia pseudomallei adhesins for human respiratory epithelial cells , 2010, BMC Microbiology.