A sparse autoencoder-based deep neural network for protein solvent accessibility and contact number prediction

BackgroundDirect prediction of the three-dimensional (3D) structures of proteins from one-dimensional (1D) sequences is a challenging problem. Significant structural characteristics such as solvent accessibility and contact number are essential for deriving restrains in modeling protein folding and protein 3D structure. Thus, accurately predicting these features is a critical step for 3D protein structure building.ResultsIn this study, we present DeepSacon, a computational method that can effectively predict protein solvent accessibility and contact number by using a deep neural network, which is built based on stacked autoencoder and a dropout method. The results demonstrate that our proposed DeepSacon achieves a significant improvement in the prediction quality compared with the state-of-the-art methods. We obtain 0.70 three-state accuracy for solvent accessibility, 0.33 15-state accuracy and 0.74 Pearson Correlation Coefficient (PCC) for the contact number on the 5729 monomeric soluble globular protein dataset. We also evaluate the performance on the CASP11 benchmark dataset, DeepSacon achieves 0.68 three-state accuracy and 0.69 PCC for solvent accessibility and contact number, respectively.ConclusionsWe have shown that DeepSacon can reliably predict solvent accessibility and contact number with stacked sparse autoencoder and a dropout approach.

[1]  Q. Zou,et al.  Recent Progress in Machine Learning-Based Methods for Protein Fold Recognition , 2016, International journal of molecular sciences.

[2]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[3]  James G. Lyons,et al.  Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[4]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.

[5]  Geoffrey E. Hinton,et al.  Learning representations of back-propagation errors , 1986 .

[6]  Kuldip K. Paliwal,et al.  Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto‐encoder deep neural network , 2014, J. Comput. Chem..

[7]  Yaoqi Zhou,et al.  Improving the prediction accuracy of residue solvent accessibility and real‐value backbone torsion angles of proteins by guided‐learning through a two‐layer neural network , 2009, Proteins.

[8]  M. Schroeder,et al.  LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation , 2006, BMC Structural Biology.

[9]  Kuo-Chen Chou,et al.  RSARF: prediction of residue solvent accessibility from protein sequence using random forest method. , 2012, Protein and peptide letters.

[10]  Aleksey A. Porollo,et al.  Accurate prediction of solvent accessibility using neural networks–based regression , 2004, Proteins.

[11]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[12]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[13]  Zhiqiang Ma,et al.  PSNO: Predicting Cysteine S-Nitrosylation Sites by Incorporating Various Sequence-Derived Features into the General Form of Chou’s PseAAC , 2014, International journal of molecular sciences.

[14]  James G. Lyons,et al.  SPIDER2: A Package to Predict Secondary Structure, Accessible Surface Area, and Main-Chain Torsional Angles by Deep Neural Networks. , 2017, Methods in molecular biology.

[15]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[16]  Eran Eyal,et al.  Importance of solvent accessibility and contact surfaces in modeling side‐chain conformations in proteins , 2004, J. Comput. Chem..

[17]  Jihong Guan,et al.  Dynamic epigenetic mode analysis using spatial temporal clustering , 2016, BMC Bioinformatics.

[18]  Zheng Yuan,et al.  Better prediction of protein contact number using a support vector regression analysis of amino acid sequence , 2005, BMC Bioinformatics.

[19]  Jianzhu Ma,et al.  AcconPred: Predicting Solvent Accessibility and Contact Number Simultaneously by a Multitask Learning Framework under the Conditional Neural Fields Model , 2015, BioMed research international.

[20]  K. Nishikawa,et al.  Predicting absolute contact numbers of native protein structure from amino acid sequence , 2004, Proteins.

[21]  Haesun Park,et al.  Prediction of protein relative solvent accessibility with support vector machines and long‐range interaction 3D local descriptor , 2004, Proteins.

[22]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[23]  Jingpu Zhang,et al.  KATZLGO: Large-Scale Prediction of LncRNA Functions by Using the KATZ Measure Based on Multiple Networks , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[24]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[25]  B. Lee,et al.  The interpretation of protein structures: estimation of static accessibility. , 1971, Journal of molecular biology.

[26]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[27]  Darby Tien-Hao Chang,et al.  Real value prediction of protein solvent accessibility using enhanced PSSM features , 2008, BMC Bioinformatics.

[28]  Jijun Tang,et al.  PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only , 2017, IEEE Transactions on NanoBioscience.

[29]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[30]  Alexander Tropsha,et al.  Scoring protein interaction decoys using exposed residues (SPIDER): A novel multibody interaction scoring function based on frequent geometric patterns of interfacial residues , 2012, Proteins.

[31]  Keehyoung Joo,et al.  proteins STRUCTURE O FUNCTION O BIOINFORMATICS SANN: Solvent accessibility prediction of proteins , 2022 .

[32]  Jiangning Song,et al.  Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information , 2006, BMC Bioinformatics.

[33]  Jingpu Zhang,et al.  Integrating Multiple Heterogeneous Networks for Novel LncRNA-Disease Association Inference , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[34]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[35]  R. Nussinov,et al.  Protein–protein interactions: Structurally conserved residues distinguish between binding sites and exposed protein surfaces , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Pierre Baldi,et al.  SSpro/ACCpro 5: almost perfect prediction of protein secondary structure and relative solvent accessibility using profiles, machine learning and structural similarity , 2014, Bioinform..

[37]  H Naderi-Manesh,et al.  Prediction of protein surface accessibility with information theory. , 2000, Proteins.

[38]  K. Dill,et al.  The Protein-Folding Problem, 50 Years On , 2012, Science.

[39]  O. Lund,et al.  Prediction of residues in discontinuous B‐cell epitopes using protein 3D structures , 2006, Protein science : a publication of the Protein Society.

[40]  Zhigang Chen,et al.  PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility , 2016, BMC Bioinformatics.

[41]  Piero Fariselli,et al.  Prediction of the Number of Residue Contacts in Proteins , 2000, ISMB.

[42]  D. Thirumalai,et al.  Pair potentials for protein folding: Choice of reference states and sensitivity of predicted native states to variations in the interaction schemes , 2008, Protein science : a publication of the Protein Society.

[43]  Jagath C Rajapakse,et al.  Two‐stage support vector regression approach for predicting accessible surface areas of amino acids , 2006, Proteins.

[44]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[45]  Maxim Totrov,et al.  Accurate and efficient generalized born model based on solvent accessibility: Derivation and application for LogP octanol/water prediction and flexible peptide docking , 2004, J. Comput. Chem..

[46]  Zhiqiang Ma,et al.  Prediction of protein solvent accessibility using PSO-SVR with multiple sequence-derived features and weighted sliding window scheme , 2014, BioData Mining.

[47]  Hui Liu,et al.  Improving compound–protein interaction prediction by building up highly credible negative samples , 2015, Bioinform..

[48]  Andrzej Tomski,et al.  Seqinspector: position-based navigation through the ChIP-seq data landscape to identify gene expression regulators , 2016, BMC Bioinformatics.

[49]  Lilia M. Iakoucheva,et al.  Intrinsic Disorder Is a Common Feature of Hub Proteins from Four Eukaryotic Interactomes , 2006, PLoS Comput. Biol..

[50]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[51]  Lei Deng,et al.  A computational interactome and functional annotation for the human proteome , 2016, eLife.

[52]  R A Goldstein,et al.  Predicting solvent accessibility: Higher accuracy using Bayesian statistics and optimized residue substitution classes , 1996, Proteins.

[53]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[54]  Guoli Wang,et al.  PISCES: a protein sequence culling server , 2003, Bioinform..

[55]  Andreas Bracher,et al.  Molecular chaperones in protein folding and proteostasis , 2011, Nature.

[56]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[57]  Sean D. Mooney,et al.  Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis , 2005, Briefings Bioinform..

[58]  Xing Gao,et al.  An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information , 2015, IEEE Transactions on NanoBioscience.

[59]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[60]  Jagath C Rajapakse,et al.  Prediction of protein relative solvent accessibility with a two‐stage SVM approach , 2005, Proteins.

[61]  Claus O Wilke,et al.  The Relationship Between Relative Solvent Accessibility and Evolutionary Rate in Protein Evolution , 2011, Genetics.

[62]  Seung-Yeon Kim,et al.  Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method , 2005, Bioinform..

[63]  Shandar Ahmad,et al.  NETASA: neural network based prediction of solvent accessibility , 2002, Bioinform..

[64]  Cheng Cheng,et al.  A permutation-based method to identify loss-of-heterozygosity using paired genotype microarray data , 2008, BMC Bioinformatics.

[65]  H. Dyson,et al.  Intrinsically unstructured proteins and their functions , 2005, Nature Reviews Molecular Cell Biology.

[66]  Aleksey A. Porollo,et al.  Combining prediction of secondary structure and solvent accessibility in proteins , 2005, Proteins.

[67]  Marco Biasini,et al.  SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information , 2014, Nucleic Acids Res..

[68]  M Vendruscolo,et al.  Statistical properties of contact vectors. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.