PaperBLAST: Text Mining Papers for Information about Homologs

With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions. ABSTRACT Large-scale genome sequencing has identified millions of protein-coding genes whose function is unknown. Many of these proteins are similar to characterized proteins from other organisms, but much of this information is missing from annotation databases and is hidden in the scientific literature. To make this information accessible, PaperBLAST uses EuropePMC to search the full text of scientific articles for references to genes. PaperBLAST also takes advantage of curated resources (Swiss-Prot, GeneRIF, and EcoCyc) that link protein sequences to scientific articles. PaperBLAST’s database includes over 700,000 scientific articles that mention over 400,000 different proteins. Given a protein of interest, PaperBLAST quickly finds similar proteins that are discussed in the literature and presents snippets of text from relevant articles or from the curators. PaperBLAST is available at http://papers.genomics.lbl.gov/ . IMPORTANCE With the recent explosion of genome sequencing data, there are now millions of uncharacterized proteins. If a scientist becomes interested in one of these proteins, it can be very difficult to find information as to its likely function. Often a protein whose sequence is similar, and which is likely to have a similar function, has been studied already, but this information is not available in any database. To help find articles about similar proteins, PaperBLAST searches the full text of scientific articles for protein identifiers or gene identifiers, and it links these articles to protein sequences. Then, given a protein of interest, it can quickly find similar proteins in its database by using standard software (BLAST), and it can show snippets of text from relevant papers. We hope that PaperBLAST will make it easier for biologists to predict proteins’ functions.

[1]  Nancy Papalopulu,et al.  Evading the annotation bottleneck: using sequence similarity to search non-sequence gene data , 2008, BMC Bioinformatics.

[2]  Peter D. Karp,et al.  EcoCyc: a comprehensive database resource for Escherichia coli , 2004, Nucleic Acids Res..

[3]  Kelly M. Wetmore,et al.  Deep Annotation of Protein Function across Diverse Bacteria from Mutant Phenotypes , 2016 .

[4]  Joyce A. Mitchell,et al.  Gene Indexing: Characterization and Analysis of NLM's GeneRIFs , 2003, AMIA.

[5]  S. Poux On expert curation and sustainability: UniProtKB/Swiss-Prot as a case study , 2017 .

[6]  P. Radivojac,et al.  Analysis of protein function and its prediction from amino acid sequence , 2011, Proteins.

[7]  Alfonso Valencia,et al.  Implementing the iHOP concept for navigation of biomedical literature , 2005, ECCB/JBI.

[8]  D. Hogan,et al.  Identification of genes required for Pseudomonas aeruginosa carnitine catabolism. , 2009, Microbiology.

[9]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[10]  Casey M. Bergman,et al.  Annotating genes and genomes with DNA sequences extracted from biomedical articles , 2011, Bioinform..

[11]  Narmada Thanki,et al.  CDD: NCBI's conserved domain database , 2014, Nucleic Acids Res..

[12]  M. Kanehisa,et al.  BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences. , 2016, Journal of molecular biology.

[13]  Nadezda Masloboeva,et al.  Reactive Oxygen Species-Inducible ECF σ Factors of Bradyrhizobium japonicum , 2012, PloS one.

[14]  Fangfang Xia,et al.  The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) , 2013, Nucleic Acids Res..

[15]  Silvio C. E. Tosatto,et al.  InterPro in 2017—beyond protein family and domain annotations , 2016, Nucleic Acids Res..

[16]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[17]  Inna Dubchak,et al.  MicrobesOnline: an integrated portal for comparative and functional genomics , 2009, Nucleic Acids Res..

[18]  Ramana Madupu,et al.  CharProtDB: a database of experimentally characterized protein annotations , 2011, Nucleic Acids Res..

[19]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[20]  Nick V. Grishin,et al.  Seq2Ref: a web server to facilitate functional interpretation , 2013, BMC Bioinformatics.

[21]  Sophia Ananiadou,et al.  Europe PMC: a full-text literature database for the life sciences and platform for innovation , 2014, Nucleic Acids Res..

[22]  Mayya Sedova,et al.  PubServer: literature searches by homology , 2014, Nucleic Acids Res..

[23]  Zhiyong Lu,et al.  On expert curation and sustainability: UniProtKB/Swiss-Prot as a case study , 2016, bioRxiv.

[24]  Tatiana A. Tatusova,et al.  Update on RefSeq microbial genomes resources , 2014, Nucleic Acids Res..

[25]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[26]  C. Kohler,et al.  Extracytoplasmic function (ECF) sigma factor σF is involved in Caulobacter crescentus response to heavy metal stress , 2012, BMC Microbiology.

[27]  Minoru Kanehisa,et al.  KEGG as a reference resource for gene and protein annotation , 2015, Nucleic Acids Res..

[28]  D. Hogan,et al.  Identification of Two Gene Clusters and a Transcriptional Regulator Required for Pseudomonas aeruginosa Glycine Betaine Catabolism , 2007, Journal of bacteriology.

[29]  Richard J. Roberts,et al.  COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps , 2015, Nucleic Acids Res..