Complete fold annotation of the human proteome using a novel structural feature space

Recognition of protein structural fold is the starting point for many structure prediction tools and protein function inference. Fold prediction is computationally demanding and recognizing novel folds is difficult such that the majority of proteins have not been annotated for fold classification. Here we describe a new machine learning approach using a novel feature space that can be used for accurate recognition of all 1,221 currently known folds and inference of unknown novel folds. We show that our method achieves better than 94% accuracy even when many folds have only one training example. We demonstrate the utility of this method by predicting the folds of 34,330 human protein domains and showing that these predictions can yield useful insights into potential biological function, such as prediction of RNA-binding ability. Our method can be applied to de novo fold prediction of entire proteomes and identify candidate novel fold families.

[1]  Yves Moreau,et al.  Protein fold recognition using geometric kernel data fusion , 2014, Bioinform..

[2]  S. Gerstberger,et al.  A census of human RNA-binding proteins , 2014, Nature Reviews Genetics.

[3]  Richard Bonneau,et al.  Superfamily Assignments for the Yeast Proteome through Integration of Structure Prediction with the Gene Ontology , 2007, PLoS biology.

[4]  Jianyi Yang,et al.  Improving taxonomy‐based protein fold recognition by using global and local features , 2011, Proteins.

[5]  Yaoqi Zhou,et al.  Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates , 2011, Bioinform..

[6]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[7]  Yang Zhang,et al.  I-TASSER: a unified platform for automated protein structure and function prediction , 2010, Nature Protocols.

[8]  Monia Magliozzi,et al.  Novel and recurrent EVC and EVC2 mutations in Ellis-van Creveld syndrome and Weyers acrofacial dyostosis. , 2013, European journal of medical genetics.

[9]  Xiaoli Zhang,et al.  RBPPred: predicting RNA‐binding proteins from sequence using SVM , 2016, Bioinform..

[10]  E. Koonin,et al.  The structure of the protein universe and genome evolution , 2002, Nature.

[11]  James M Aramini,et al.  Assessment of template‐based protein structure predictions in CASP10 , 2014, Proteins.

[12]  I. Muchnik,et al.  Recognition of a protein fold in the context of the SCOP classification , 1999 .

[13]  Paul D. Adams,et al.  Structural Genomics of Minimal Organisms and Protein Fold Space , 2005, Journal of Structural and Functional Genomics.

[14]  Jian Peng,et al.  A conditional neural fields model for protein threading , 2012, Bioinform..

[15]  Feng Zhao,et al.  Protein threading using context-specific alignment potential , 2013, Bioinform..

[16]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[17]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[18]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[19]  Sarah A. Middleton,et al.  NoFold: RNA structure clustering without folding or alignment , 2014, RNA.

[20]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[21]  Kuldip K. Paliwal,et al.  A Segmentation-Based Method to Extract Structural and Evolutionary Features for Protein Fold Recognition , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[22]  James G. Lyons,et al.  Probabilistic expression of spatially varied amino acid dimers into general form of Chou׳s pseudo amino acid composition for protein fold recognition. , 2015, Journal of theoretical biology.

[23]  Jian Peng,et al.  Template-based protein structure modeling using the RaptorX web server , 2012, Nature Protocols.

[24]  Lars Malmström,et al.  The Proteome Folding Project: proteome-scale prediction of structure and function. , 2011, Genome research.

[25]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[26]  James G. Lyons,et al.  Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models , 2015, IEEE Transactions on NanoBioscience.

[27]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[28]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[29]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[30]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[31]  R. Sakai,et al.  A Novel RNA-Binding Protein, Ossa/C9orf10, Regulates Activity of Src Kinases To Protect Cells from Oxidative Stress-Induced Apoptosis , 2008, Molecular and Cellular Biology.

[32]  Taeho Jo,et al.  Improving Protein Fold Recognition by Deep Learning Networks , 2015, Scientific Reports.

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Johannes Söding,et al.  Fast and accurate automatic structure prediction with HHpred , 2009, Proteins.

[35]  E. Ginns,et al.  A new gene, EVC2, is mutated in Ellis-van Creveld syndrome. , 2002, Molecular genetics and metabolism.