DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier

Abstract Motivation A large number of protein sequences are becoming available through the application of novel high-throughput sequencing technologies. Experimental functional characterization of these proteins is time-consuming and expensive, and is often only done rigorously for few selected model organisms. Computational function prediction approaches have been suggested to fill this gap. The functions of proteins are classified using the Gene Ontology (GO), which contains over 40 000 classes. Additionally, proteins have multiple functions, making function prediction a large-scale, multi-class, multi-label problem. Results We have developed a novel method to predict protein function from sequence. We use deep learning to learn features from protein sequences as well as a cross-species protein–protein interaction network. Our approach specifically outputs information in the structure of the GO and utilizes the dependencies between GO classes as background information to construct a deep learning model. We evaluate our method using the standards established by the Computational Assessment of Function Annotation (CAFA) and demonstrate a significant improvement over baseline methods such as BLAST, in particular for predicting cellular locations. Availability and implementation Web server: http://deepgo.bio2vec.net, Source code: https://github.com/bio-ontology-research-group/deepgo Supplementary information Supplementary data are available at Bioinformatics online.

[1]  John D. Osborne,et al.  Annotating the human genome with Disease , 2009 .

[2]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[3]  Davide Heller,et al.  eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences , 2015, Nucleic Acids Res..

[4]  Krzysztof J. Cios,et al.  Protein annotation from protein interaction networks and Gene Ontology , 2011, J. Biomed. Informatics.

[5]  Predrag Radivojac,et al.  Information-theoretic evaluation of predicted ontological annotations , 2013, Bioinform..

[6]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[7]  Jingyu Hou Protein Function Prediction from Functional Connectivity , 2017 .

[8]  Silvio C. E. Tosatto,et al.  InterPro in 2017—beyond protein family and domain annotations , 2016, Nucleic Acids Res..

[9]  Adam P. Rosebrock,et al.  A global genetic interaction network maps a wiring diagram of cellular function , 2016, Science.

[10]  David A. Lee,et al.  Functional classification of CATH superfamilies: a domain-based approach for protein function annotation , 2015, Bioinform..

[11]  Q JiangJonathan,et al.  Predicting Protein Function by Multi-Label Correlated Semi-Supervised Learning , 2012 .

[12]  Michael J. E. Sternberg,et al.  CombFunc: predicting protein function using heterogeneous data sources , 2012, Nucleic Acids Res..

[13]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[14]  W. Kibbe,et al.  Annotating the human genome with Disease Ontology , 2009, BMC Genomics.

[15]  N. Blackstone,et al.  Molecular Biology of the Cell.Fourth Edition.ByBruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and, Peter Walter.New York: Garland Science.$102.00. xxxiv + 1463 p; ill.; glossary (G:1–G:36); index (I:1–I:49); tables (T:1). ISBN: 0–8153–3218–1. [CD‐ROM included.] 2002. , 2003 .

[16]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[17]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[18]  Gultekin Özsoyoglu,et al.  Protein Function Prediction Based on Patterns in Biological Networks , 2008, RECOMB.

[19]  Karin M. Verspoor,et al.  Roles for text mining in protein function prediction. , 2014, Methods in molecular biology.

[20]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[21]  Dusanka Janezic,et al.  Structure-Based Function Prediction of Uncharacterized Protein Using Binding Sites Comparison , 2013, PLoS Comput. Biol..

[22]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP) — round x , 2014, Proteins.

[23]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[24]  Michael I. Jordan,et al.  Genome-scale phylogenetic function annotation of large and diverse protein families. , 2011, Genome research.

[25]  Suzanna Lewis,et al.  Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium , 2011, Briefings Bioinform..

[26]  Weidong Tian,et al.  GoFDR: A sequence alignment based method for predicting protein functions. , 2016, Methods.

[27]  Jingyu Hou,et al.  New Approaches of Protein Function Prediction from Protein Interaction Networks , 2017 .

[28]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[29]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[30]  Akira R. Kinjo,et al.  Neuro-symbolic representation learning on biological knowledge graphs , 2016, Bioinform..

[31]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[32]  Paul N. Schofield,et al.  The anatomy of phenotype ontologies: principles, properties and applications , 2017, Briefings Bioinform..

[33]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[34]  Jonathan Qiang Jiang,et al.  Predicting Protein Function by Multi-Label Correlated Semi-Supervised Learning , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[36]  Mariano Sigman,et al.  The language of geometry: Fast comprehension of geometrical primitives and rules in human adults and preschoolers , 2017, PLoS Comput. Biol..

[37]  Hannah Currant,et al.  FFPred 3: feature-based function prediction for all Gene Ontology domains , 2016, Scientific Reports.

[38]  Asa Ben-Hur,et al.  Hierarchical Classification of Gene Ontology Terms Using the Gostruct Method , 2010, J. Bioinform. Comput. Biol..

[39]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[40]  I. Xenarios,et al.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. , 2016, Methods in molecular biology.

[41]  Davide Heller,et al.  STRING v10: protein–protein interaction networks, integrated over the tree of life , 2014, Nucleic Acids Res..