Annotating genes and genomes with DNA sequences extracted from biomedical articles

Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study. Results: Here, we report the results of a novel text-mining approach that extracts DNA sequences from biomedical articles and automatically maps them to genomic databases. We find that ∼20% of open access articles in PubMed central (PMC) have extractable DNA sequences that can be accurately mapped to the correct gene (91%) and genome (96%). We illustrate the utility of data extracted by text2genome from more than 150 000 PMC articles for the interpretation of ChIP-seq data and the design of quantitative reverse transcriptase (RT)-PCR experiments. Conclusion: Our approach links articles to genes and organisms without relying on gene names or identifiers. It also produces genome annotation tracks of the biomedical literature, thereby allowing researchers to use the power of modern genome browsers to access and analyze publications in the context of genomic data. Availability and implementation: Source code is available under a BSD license from http://sourceforge.net/projects/text2genome/ and results can be browsed and downloaded at http://text2genome.org. Contact: maximilianh@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Manfred S Weiss,et al.  Citations in supplementary material. , 2010, Acta crystallographica. Section D, Biological crystallography.

[2]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[3]  The FlyBase database of the Drosophila genome projects and community literature. , 2003, Nucleic acids research.

[4]  A. J. Schroeder,et al.  The FlyBase database of the Drosophila Genome Projects and community literature. , 2002, Nucleic acids research.

[5]  Alexander A. Morgan,et al.  Data preparation and interannotator agreement: BioCreAtIvE Task 1B , 2005, BMC Bioinformatics.

[6]  F. Speleman,et al.  Accurate normalization of real-time quantitative RT-PCR data by geometric averaging of multiple internal control genes , 2002, Genome Biology.

[7]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[8]  Alberto Anguita,et al.  PubDNA Finder: a web database linking full-text articles to sequences of nucleic acids , 2010, Bioinform..

[9]  Tetsuro Toyoda,et al.  PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning , 2009, Nucleic Acids Res..

[10]  Sergei A. Nedospasov,et al.  Nucleotide sequence of the murine TNF locus, including the TNF-α (tumor necrosis factor) and TNF-β (lymphotoxin) genes , 1987 .

[11]  R. J. Roberts PubMed Central: The GenBank of the published literature. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[13]  Peter Tarczy-Hornoch,et al.  On the persistence of supplementary resources in biomedical publications , 2006, BMC Bioinformatics.

[14]  Gautier Koscielny,et al.  Ensembl Genomes: Extending Ensembl across the taxonomic space , 2009, Nucleic Acids Res..

[15]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[16]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[17]  Steven J. M. Jones,et al.  Text-mining assisted regulatory annotation , 2008, Genome Biology.

[18]  Miguel García-Remesal,et al.  A method for automatically extracting infectious disease-related primers and probes from the literature , 2010, BMC Bioinformatics.

[19]  Ting Wang,et al.  The UCSC Genome Browser Database: update 2009 , 2008, Nucleic Acids Res..

[20]  P. Gray,et al.  The murine tumor necrosis factor-beta (lymphotoxin) gene sequence. , 1987, Nucleic acids research.

[21]  Sean R. Eddy,et al.  The Distributed Annotation System , 2001, BMC Bioinformatics.

[22]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[23]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2006, Nucleic Acids Res..

[24]  Jonathan D. Wren,et al.  Markov model recognition and classification of DNA/protein sequences within large text databases , 2005, Bioinform..

[25]  Andrew M. Jenkinson,et al.  Ensembl 2009 , 2008, Nucleic Acids Res..

[26]  ROY MARKHAM,et al.  Structure of Ribonucleic Acid , 1951, Nature.

[27]  J. Golden,et al.  Identification of Arx transcriptional targets in the developing basal forebrain , 2008, Human molecular genetics.

[28]  A. Visel,et al.  ChIP-seq accurately predicts tissue-specific activity of enhancers , 2009, Nature.

[29]  Timur Shtatland,et al.  PepBank - a database of peptides based on sequence text mining and public peptide data sources , 2007, BMC Bioinformatics.

[30]  David Haussler,et al.  The UCSC Genome Browser database: update 2010 , 2009, Nucleic Acids Res..

[31]  C. Heid,et al.  A novel method for real time quantitative RT-PCR. , 1996, Genome research.

[32]  Terrence S. Furey,et al.  The UCSC Table Browser data retrieval tool , 2004, Nucleic Acids Res..

[33]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[34]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[35]  C V Jongeneel,et al.  Nucleotide sequence of the murine TNF locus, including the TNF-alpha (tumor necrosis factor) and TNF-beta (lymphotoxin) genes. , 1987, Nucleic acids research.