Prodepth: Predict Residue Depth by Support Vector Regression Approach from Protein Sequences Only

Residue depth (RD) is a solvent exposure measure that complements the information provided by conventional accessible surface area (ASA) and describes to what extent a residue is buried in the protein structure space. Previous studies have established that RD is correlated with several protein properties, such as protein stability, residue conservation and amino acid types. Accurate prediction of RD has many potentially important applications in the field of structural bioinformatics, for example, facilitating the identification of functionally important residues, or residues in the folding nucleus, or enzyme active sites from sequence information. In this work, we introduce an efficient approach that uses support vector regression to quantify the relationship between RD and protein sequence. We systematically investigated eight different sequence encoding schemes including both local and global sequence characteristics and examined their respective prediction performances. For the objective evaluation of our approach, we used 5-fold cross-validation to assess the prediction accuracies and showed that the overall best performance could be achieved with a correlation coefficient (CC) of 0.71 between the observed and predicted RD values and a root mean square error (RMSE) of 1.74, after incorporating the relevant multiple sequence features. The results suggest that residue depth could be reliably predicted solely from protein primary sequences: local sequence environments are the major determinants, while global sequence features could influence the prediction performance marginally. We highlight two examples as a comparison in order to illustrate the applicability of this approach. We also discuss the potential implications of this new structural parameter in the field of protein structure prediction and homology modeling. This method might prove to be a powerful tool for sequence analysis.

[1]  Zheng Yuan,et al.  Better prediction of protein contact number using a support vector regression analysis of amino acid sequence , 2005, BMC Bioinformatics.

[2]  A. Shrake,et al.  Environment and exposure to solvent of protein atoms. Lysozyme and insulin. , 1973, Journal of molecular biology.

[3]  Dinesh Gupta,et al.  CyclinPred: A SVM-Based Method for Predicting Cyclin Protein Sequences , 2008, PloS one.

[4]  Hongyi Zhou,et al.  Fold recognition by combining sequence profiles derived from evolution and from depth‐dependent structural alignment of fragments , 2004, Proteins.

[5]  Harren Jhoti,et al.  High-throughput crystallography for lead discovery in drug design , 2002, Nature Reviews Drug Discovery.

[6]  Oliviero Carugo,et al.  Atom depth as a descriptor of the protein interior. , 2003, Biophysical journal.

[7]  Nagasuma Chandra,et al.  PocketDepth: a new depth based algorithm for identification of ligand binding sites in proteins. , 2008, Journal of structural biology.

[8]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[9]  Y. Mechulam,et al.  Crystal structure at 1.2 Å resolution and active site mapping of Escherichia coli peptidyl‐tRNA hydrolase , 1997, The EMBO journal.

[10]  Ji Wan,et al.  SVRMHC prediction server for MHC-binding peptides , 2006, BMC Bioinformatics.

[11]  Peter Clote,et al.  Disulfide connectivity prediction using secondary structure information and diresidue frequencies , 2005, Bioinform..

[12]  Tamotsu Noguchi,et al.  PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB) in 2003 , 2003, Nucleic Acids Res..

[13]  Song Liu,et al.  Fold recognition by concurrent use of solvent accessibility and residue depth , 2007, Proteins.

[14]  Avner Schlessinger,et al.  PROFbval: predict flexible and rigid residues in proteins , 2006, Bioinform..

[15]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[16]  M. L. Connolly Solvent-accessible surfaces of proteins and nucleic acids. , 1983, Science.

[17]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[18]  T. Hamelryck An amino acid has two sides: A new 2D measure provides a different view of solvent exposure , 2005, Proteins.

[19]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[20]  Ao Li,et al.  LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST , 2005, Nucleic Acids Res..

[21]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[22]  Avner Schlessinger,et al.  Natively Unstructured Loops Differ from Other Loops , 2007, PLoS Comput. Biol..

[23]  Ao Li,et al.  Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme , 2006, BMC Bioinformatics.

[24]  L. Regan,et al.  Guidelines for Protein Design: The Energetics of β Sheet Side Chain Interactions , 1995, Science.

[25]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[26]  Zheng Yuan,et al.  How good is prediction of protein structural class by the component‐coupled method? , 2000, Proteins.

[27]  C. Chothia,et al.  Hydrophobic bonding and accessible surface area in proteins , 1974, Nature.

[28]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[29]  E. Marcotte,et al.  X-ray structure of an anti-fungal chitosanase from streptomyces N174 , 1996, Nature Structural Biology.

[30]  Piero Fariselli,et al.  Improved prediction of the number of residue contacts in proteins by recurrent neural networks , 2001, ISMB.

[31]  Wei Zhang,et al.  SP5: Improving Protein Fold Recognition by Using Torsion Angle Profiles and Profile-Based Gap Penalty Model , 2008, PloS one.

[32]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..

[33]  Oliviero Carugo,et al.  Atom depth in protein structure and function. , 2003, Trends in biochemical sciences.

[34]  Lukasz A. Kurgan,et al.  SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences , 2008, BMC Bioinformatics.

[35]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[36]  Minho Lee,et al.  Predicting and improving the protein sequence alignment quality by support vector regression , 2007, BMC Bioinformatics.

[37]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[38]  T. Letzel,et al.  Accessory active site residues of Streptomyces sp. N174 chitosanase , 2009, The FEBS journal.

[39]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[40]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[41]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[42]  David Baker,et al.  Ranking predicted protein structures with support vector regression , 2007, Proteins.

[43]  Haruki Nakamura,et al.  Nature of Protein Family Signatures: Insights from Singular Value Analysis of Position-Specific Scoring Matrices , 2007, PloS one.

[44]  A. Sali,et al.  Protein Structure Prediction and Structural Genomics , 2001, Science.

[45]  R. Varadarajan,et al.  Residue depth: a novel parameter for the analysis of protein structure and stability. , 1999, Structure.

[46]  Huan‐Xiang Zhou,et al.  Prediction of solvent accessibility and sites of deleterious mutations from protein sequence , 2005, Nucleic acids research.

[47]  Zheng Yuan,et al.  Quantifying the relationship of protein burying depth and sequence , 2007, Proteins.

[48]  Jiangning Song,et al.  Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure , 2007, Bioinform..

[49]  Lukasz A. Kurgan,et al.  Secondary structure-based assignment of the protein structural classes , 2008, Amino Acids.

[50]  K. Sharp,et al.  Travel depth, a new shape descriptor for macromolecules: application to ligand binding. , 2006, Journal of molecular biology.

[51]  Cyrus Chothia,et al.  The accessible surface area and stability of oligomeric proteins , 1987, Nature.

[52]  B. Lee,et al.  The interpretation of protein structures: estimation of static accessibility. , 1971, Journal of molecular biology.

[53]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[54]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[55]  R. Varadarajan,et al.  Mutagenesis-based definitions and probes of residue burial in proteins. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[56]  K. Nishikawa,et al.  Predicting absolute contact numbers of native protein structure from amino acid sequence , 2004, Proteins.

[57]  O. Schueler‐Furman,et al.  Progress in Modeling of Protein Structures and Interactions , 2005, Science.

[58]  J M Chandonia,et al.  Neural networks for secondary structure and structural class predictions , 1995, Protein science : a publication of the Protein Society.

[59]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[60]  Burkhard Rost,et al.  Prediction of DNA-binding residues from sequence , 2007, ISMB/ECCB.

[61]  Andrea Bernini,et al.  Three-dimensional computation of atom depth in complex molecular structures , 2005, Bioinform..

[62]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[63]  Wen Liu,et al.  Quantitative prediction of mouse class I MHC peptide binding affinity using support vector machine regression (SVR) models , 2006, BMC Bioinformatics.

[64]  Avner Schlessinger,et al.  Natively unstructured regions in proteins identified from contact predictions , 2007, Bioinform..

[65]  Pierre Baldi,et al.  Improved residue contact prediction using support vector machines and a large feature set , 2007, BMC Bioinformatics.

[66]  B. Rost,et al.  SNAP: predict effect of non-synonymous polymorphisms on function , 2007, Nucleic acids research.

[67]  Avner Schlessinger,et al.  Improved Disorder Prediction by Combination of Orthogonal Approaches , 2009, PloS one.

[68]  Oliviero Carugo,et al.  DPX: for the analysis of the protein core , 2003, Bioinform..

[69]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[70]  Jiangning Song,et al.  Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information , 2006, BMC Bioinformatics.

[71]  Jaume Bacardit,et al.  Prediction of recursive convex hull class assignments for protein residues , 2008, Bioinform..

[72]  Gajendra P. S. Raghava,et al.  Correlation and prediction of gene expression level from amino acid and dipeptide composition of its protein , 2005, BMC Bioinformatics.

[73]  Kengo Kinoshita,et al.  PrDOS: prediction of disordered protein regions from amino acid sequence , 2007, Nucleic Acids Res..

[74]  Jagath C Rajapakse,et al.  Two‐stage support vector regression approach for predicting accessible surface areas of amino acids , 2006, Proteins.

[75]  Christopher J. Oldfield,et al.  Intrinsic disorder and functional proteomics. , 2007, Biophysical journal.

[76]  Nello Cristianini,et al.  Advances in Kernel Methods - Support Vector Learning , 1999 .

[77]  Oliviero Carugo,et al.  CX, DPX and PRIDE: WWW servers for the analysis and comparison of protein 3D structures , 2005, Nucleic Acids Res..

[78]  Lukasz A. Kurgan,et al.  Sequence based residue depth prediction using evolutionary information and predicted secondary structure , 2008, BMC Bioinformatics.

[79]  Burkhard Rost,et al.  Protein–Protein Interaction Hotspots Carved into Sequences , 2007, PLoS Comput. Biol..

[80]  L. Regan,et al.  Guidelines for protein design: the energetics of beta sheet side chain interactions. , 1995, Science.

[81]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[82]  David T. Jones,et al.  Improving the accuracy of transmembrane protein topology prediction using evolutionary information , 2007, Bioinform..

[83]  Burkhard Rost,et al.  The PredictProtein server , 2003, Nucleic Acids Res..

[84]  Jiangning Song,et al.  Predicting residue-wise contact orders in proteins by support vector regression , 2006, BMC Bioinformatics.