Predicting Human Protein Function with Multi-task Deep Neural Networks

Machine learning methods for protein function prediction are urgently needed, especially now that a substantial fraction of known sequences remains unannotated despite the extensive use of functional assignments based on sequence similarity. One major bottleneck supervised learning faces in protein function prediction is the structured, multi-label nature of the problem, because biological roles are represented by lists of terms from hierarchically organised controlled vocabularies such as the Gene Ontology. In this work, we build on recent developments in the area of deep learning and investigate the usefulness of multi-task deep neural networks (MTDNN), which consist of upstream shared layers upon which are stacked in parallel as many independent modules (additional hidden layers with their own output units) as the number of output GO terms (the tasks). MTDNN learns individual tasks partially using shared representations and partially from task-specific characteristics. When no close homologues with experimentally validated functions can be identified, MTDNN gives more accurate predictions than baseline methods based on annotation frequencies in public databases or homology transfers. More importantly, the results show that MTDNN binary classification accuracy is higher than alternative machine learning-based methods that do not exploit commonalities and differences among prediction tasks. Interestingly, compared with a single-task predictor, the performance improvement is not linearly correlated with the number of tasks in MTDNN, but medium size models provide more improvement in our case. One of advantages of MTDNN is that given a set of features, there is no requirement for MTDNN to have a bootstrap feature selection procedure as what traditional machine learning algorithms do. Overall, the results indicate that the proposed MTDNN algorithm improves the performance of protein function prediction. On the other hand, there is still large room for deep learning techniques to further enhance prediction ability.

[1]  Tapio Salakoski,et al.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy , 2016, Genome Biology.

[2]  Yanjun Qi,et al.  A Unified Multitask Architecture for Predicting Local Protein Properties , 2012, PloS one.

[3]  Keun Ho Ryu,et al.  Identification of protein functions using a machine-learning approach based on sequence-derived properties , 2009, Proteome Science.

[4]  Nikhil Ketkar,et al.  Deep Learning with Python , 2017 .

[5]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[6]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[7]  Damiano Piovesan,et al.  FFPred 2.0: Improved Homology-Independent Prediction of Gene Ontology Terms for Eukaryotic Protein Sequences , 2013, PloS one.

[8]  Christine A. Orengo,et al.  FFPred: an integrated feature-based function prediction server for vertebrate proteomes , 2008, Nucleic Acids Res..

[9]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[10]  Suzanna Lewis,et al.  Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium , 2011, Briefings Bioinform..

[11]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[12]  Yaoqi Zhou,et al.  Improving protein disorder prediction by deep bidirectional long short‐term memory recurrent neural networks , 2016, Bioinform..

[13]  Prudence Mutowo-Meullenet,et al.  The GOA database: Gene Ontology annotation updates for 2015 , 2014, Nucleic Acids Res..

[14]  Qingguo Wang,et al.  Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives , 2013, BMC Bioinformatics.

[15]  David T Jones,et al.  Computational Methods for Annotation Transfers from Sequence. , 2016, Methods in molecular biology.

[16]  Ting Chen,et al.  Integrative approaches for predicting protein function and prioritizing genes for complex phenotypes using protein interaction networks , 2014, Briefings Bioinform..

[17]  Michael Y. Galperin,et al.  Comparative Genomics Approaches to Identifying Functionally Related Genes , 2014, AlCoB.

[18]  Michael J. E. Sternberg,et al.  CombFunc: predicting protein function using heterogeneous data sources , 2012, Nucleic Acids Res..

[19]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[20]  Ole Winther,et al.  Protein Secondary Structure Prediction with Long Short Term Memory Networks , 2014, ArXiv.

[21]  Olga G. Troyanskaya,et al.  Putting microarrays in a context: Integrated analysis of diverse biological data , 2005, Briefings Bioinform..

[22]  P. Radivojac,et al.  Analysis of protein function and its prediction from amino acid sequence , 2011, Proteins.

[23]  Hannah Currant,et al.  FFPred 3: feature-based function prediction for all Gene Ontology domains , 2016, Scientific Reports.

[24]  Johannes Söding,et al.  kClust: fast and sensitive clustering of large protein sequence databases , 2013, BMC Bioinformatics.

[25]  Silvio C. E. Tosatto,et al.  INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity , 2015, Nucleic Acids Res..

[26]  Zhen Li,et al.  Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model , 2016, bioRxiv.

[27]  Daniel W. A. Buchan,et al.  Protein function prediction by massive integration of evolutionary analyses and multiple data sources , 2013, BMC Bioinformatics.

[28]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[29]  María Martín,et al.  The Gene Ontology: enhancements for 2011 , 2011, Nucleic Acids Res..

[30]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[31]  The Gene Ontology Consortium,et al.  Expansion of the Gene Ontology knowledgebase and resources , 2016, Nucleic Acids Res..

[32]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[33]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[34]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[35]  Nikhil Ketkar Introduction to Theano , 2017 .

[36]  Vijay S. Pande,et al.  Massively Multitask Networks for Drug Discovery , 2015, ArXiv.

[37]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[38]  Matteo Pellegrini,et al.  Using phylogenetic profiles to predict functional relationships. , 2012, Methods in molecular biology.