RBSURFpred: Modeling protein accessible surface area in real and binary space using regularized and optimized regression.

Accessible surface area (ASA) of a protein residue is an effective feature for protein structure prediction, binding region identification, fold recognition problems etc. Improving the prediction of ASA by the application of effective feature variables is a challenging but explorable task to consider, specially in the field of machine learning. Among the existing predictors of ASA, REGAd3p is a highly accurate ASA predictor which is based on regularized exact regression with polynomial kernel of degree 3. In this work, we present a new predictor RBSURFpred, which extends REGAd3p on several dimensions by incorporating 58 physicochemical, evolutionary and structural properties into 9-tuple peptides via Chou's general PseAAC, which allowed us to obtain higher accuracies in predicting both real-valued and binary ASA. We have compared RBSURFpred for both real and binary space predictions with state-of-the-art predictors, such as REGAd3p and SPIDER2. We also have carried out a rigorous analysis of the performance of RBSURFpred in terms of different amino acids and their properties, and also with biologically relevant case-studies. The performance of RBSURFpred establishes itself as a useful tool for the community.

[1]  Lukasz Kurgan,et al.  DeepCNF-D: Predicting Protein Order/Disorder Regions by Weighted Deep Convolutional Neural Fields , 2015, International journal of molecular sciences.

[2]  Jianlin Cheng,et al.  DNdisorder: predicting protein disorder using boosting and deep networks , 2013, BMC Bioinformatics.

[3]  K. Chou,et al.  iPGK-PseAAC: Identify Lysine Phosphoglycerylation Sites in Proteins by Incorporating Four Different Tiers of Amino Acid Pairwise Coupling Information into the General PseAAC. , 2017, Medicinal chemistry (Shariqah (United Arab Emirates)).

[4]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[5]  Xin Deng,et al.  DoBo: Protein domain boundary prediction by integrating evolutionary signals and machine learning , 2011, BMC Bioinformatics.

[6]  Shandar Ahmad,et al.  NETASA: neural network based prediction of solvent accessibility , 2002, Bioinform..

[7]  Kuo-Chen Chou,et al.  iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets , 2016, Molecules.

[8]  Yaoqi Zhou,et al.  Fluctuations of backbone torsion angles obtained from NMR‐determined structures and their prediction , 2010, Proteins.

[9]  Lixiao Wang,et al.  OnD-CRF: prediciting order and disorder in proteins conditional random fields , 2008, Bioinform..

[10]  Austin G. Meyer,et al.  Maximum Allowed Solvent Accessibilites of Residues in Proteins , 2012, PloS one.

[11]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[12]  H. Abbass The self-adaptive Pareto differential evolution algorithm , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[15]  Wei Chen,et al.  iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences , 2016, Oncotarget.

[16]  Xian-Ming Pan,et al.  New method for accurate prediction of solvent accessibility from protein sequence , 2001, Proteins.

[17]  S H Kim,et al.  Predicting surface exposure of amino acids from protein sequence. , 1990, Protein engineering.

[18]  Huan-Xiang Zhou,et al.  Prediction of interface residues in protein–protein complexes by a consensus neural network method: Test against NMR data , 2005, Proteins.

[19]  K. Chou Using subsite coupling to predict signal peptides. , 2001, Protein engineering.

[20]  N. Ben-Tal,et al.  Residue frequencies and pairing preferences at protein–protein interfaces , 2001, Proteins.

[21]  K. Chou,et al.  iAMP-2L: a two-level multi-label classifier for identifying antimicrobial peptides and their functional types. , 2013, Analytical biochemistry.

[22]  K. Chou,et al.  Low-frequency collective motion in biomacromolecules and its biological functions. , 1988, Biophysical chemistry.

[23]  Wei Wu,et al.  Accurate prediction of protein relative solvent accessibility using a balanced model , 2017, BioData Mining.

[24]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.

[25]  Bull,et al.  An Overview of Genetic Algorithms: Part 2, Research Topics , 1993 .

[26]  M. Karplus,et al.  Native proteins are surface-molten solids: application of the Lindemann criterion for the solid versus liquid state. , 1999, Journal of molecular biology.

[27]  Hans-Paul Schwefel,et al.  Evolution strategies – A comprehensive introduction , 2002, Natural Computing.

[28]  B. Lee,et al.  The interpretation of protein structures: estimation of static accessibility. , 1971, Journal of molecular biology.

[29]  Song Liu,et al.  Fold recognition by concurrent use of solvent accessibility and residue depth , 2007, Proteins.

[30]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[31]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[32]  Jens Meiler,et al.  Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks , 2001 .

[33]  M. Gromiha,et al.  Real value prediction of solvent accessibility from amino acid sequence , 2003, Proteins.

[34]  K. Chou,et al.  Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. , 2007, Biochemical and biophysical research communications.

[35]  J. Marsh Buried and accessible surface area control intrinsic protein flexibility. , 2013, Journal of molecular biology.

[36]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[37]  Doheon Lee,et al.  A feature-based approach to modeling protein–protein interaction hot spots , 2009, Nucleic acids research.

[38]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[39]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[40]  K. Chou,et al.  iSNO-PseAAC: Predict Cysteine S-Nitrosylation Sites in Proteins by Incorporating Position Specific Amino Acid Propensity into Pseudo Amino Acid Composition , 2013, PloS one.

[41]  Yuedong Yang,et al.  Predicting continuous local structure and the effect of its substitution for secondary structure in fragment-free protein structure prediction. , 2009, Structure.

[42]  Ny Chen,et al.  THE BIOLOGICAL FUNCTIONS OF LOW-FREQUENCY PHONONS .2. COOPERATIVE EFFECTS , 1981 .

[43]  P. Tompa,et al.  The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. , 2005, Journal of molecular biology.

[44]  Alexander Tropsha,et al.  Scoring protein interaction decoys using exposed residues (SPIDER): A novel multibody interaction scoring function based on frequent geometric patterns of interfacial residues , 2012, Proteins.

[45]  Sumaiya Iqbal,et al.  Improved prediction of accessible surface area results in efficient energy function application. , 2015, Journal of theoretical biology.

[46]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[47]  Harpreet Kaur,et al.  Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure , 2005, Proteins.

[48]  Tingjun Hou,et al.  Develop and Test a Solvent Accessible Surface Area-Based Model in Conformational Entropy Calculations , 2012, J. Chem. Inf. Model..

[49]  Kuo-Chen Chou,et al.  Identification of protein-protein binding sites by incorporating the physicochemical properties and stationary wavelet transforms into pseudo amino acid composition , 2016, Journal of biomolecular structure & dynamics.

[50]  K. Chou,et al.  iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC , 2017, Molecular therapy. Nucleic acids.

[51]  Jian Zhou,et al.  Deep Supervised and Convolutional Generative Stochastic Network for Protein Secondary Structure Prediction , 2014, ICML.

[52]  Hahn-Ming Lee,et al.  Prediction and evolutionary information analysis of protein solvent accessibility using multiple linear regression , 2005, Proteins.

[53]  Sumaiya Iqbal,et al.  DisPredict: A Predictor of Disordered Protein Using Optimized RBF Kernel , 2015, PloS one.

[54]  Lukasz Kurgan,et al.  On the relation between residue flexibility and local solvent accessibility in proteins , 2009, Proteins.

[55]  Jung-Ying Wang,et al.  SVM‐Cabins: Prediction of solvent accessibility using accumulation cutoff set and support vector machine , 2007, Proteins.

[56]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[57]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[58]  K. Chou,et al.  pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. , 2018, Genomics.

[59]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[60]  Kuo-Chen Chou,et al.  An Unprecedented Revolution in Medicinal Chemistry Driven by the Progress of Biological Science. , 2017, Current topics in medicinal chemistry.

[61]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[62]  K. Chou,et al.  iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins , 2013, PeerJ.

[63]  Yaoqi Zhou,et al.  Improving the prediction accuracy of residue solvent accessibility and real‐value backbone torsion angles of proteins by guided‐learning through a two‐layer neural network , 2009, Proteins.

[64]  Aleksey A. Porollo,et al.  Accurate prediction of solvent accessibility using neural networks–based regression , 2004, Proteins.

[65]  G. Rose,et al.  Hydrophobicity of amino acid residues in globular proteins. , 1985, Science.

[66]  K. Chou,et al.  THE BIOLOGICAL FUNCTIONS OF LOW-FREQUENCY PHONONS , 1977 .

[67]  Haesun Park,et al.  Prediction of protein relative solvent accessibility with support vector machines and long‐range interaction 3D local descriptor , 2004, Proteins.

[68]  Sumaiya Iqbal,et al.  Estimation of Position Specific Energy as a Feature of Protein Residues from Sequence Alone for Structural Classification , 2016, PloS one.

[69]  R A Goldstein,et al.  Predicting solvent accessibility: Higher accuracy using Bayesian statistics and optimized residue substitution classes , 1996, Proteins.

[70]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[71]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[72]  O. Carugo,et al.  Predicting residue solvent accessibility from protein sequence by considering the sequence environment. , 2000, Protein engineering.

[73]  K. Chou,et al.  Euk-mPLoc: a fusion classifier for large-scale eukaryotic protein subcellular location prediction by incorporating multiple sites. , 2007, Journal of proteome research.

[74]  Huan‐Xiang Zhou,et al.  Prediction of protein interaction sites from sequence profile and residue neighbor list , 2001, Proteins.

[75]  S. Pascarella,et al.  Improvement in prediction of solvent accessibility by probability profiles. , 2003, Protein engineering.

[76]  Lixiao Wang,et al.  OnD-CRF: predicting order and disorder in proteins conditional random fields , 2008, Bioinform..

[77]  Kevin Burrage,et al.  Prediction of protein solvent accessibility using support vector machines , 2002, Proteins.

[78]  W. Delano Unraveling hot spots in binding interfaces: progress and challenges. , 2002, Current opinion in structural biology.

[79]  David T. Jones,et al.  DISOPRED3: precise disordered region predictions with annotated protein-binding activity , 2014, Bioinform..