Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs

BackgroundTraditionally, it is believed that the native structure of a protein corresponds to a global minimum of its free energy. However, with the growing number of known tertiary (3D) protein structures, researchers have discovered that some proteins can alter their structures in response to a change in their surroundings or with the help of other proteins or ligands. Such structural shifts play a crucial role with respect to the protein function. To this end, we propose a machine learning method for the prediction of the flexible/rigid regions of proteins (referred to as FlexRP); the method is based on a novel sequence representation and feature selection. Knowledge of the flexible/rigid regions may provide insights into the protein folding process and the 3D structure prediction.ResultsThe flexible/rigid regions were defined based on a dataset, which includes protein sequences that have multiple experimental structures, and which was previously used to study the structural conservation of proteins. Sequences drawn from this dataset were represented based on feature sets that were proposed in prior research, such as PSI-BLAST profiles, composition vector and binary sequence encoding, and a newly proposed representation based on frequencies of k-spaced amino acid pairs. These representations were processed by feature selection to reduce the dimensionality. Several machine learning methods for the prediction of flexible/rigid regions and two recently proposed methods for the prediction of conformational changes and unstructured regions were compared with the proposed method. The FlexRP method, which applies Logistic Regression and collocation-based representation with 95 features, obtained 79.5% accuracy. The two runner-up methods, which apply the same sequence representation and Support Vector Machines (SVM) and Naïve Bayes classifiers, obtained 79.2% and 78.4% accuracy, respectively. The remaining considered methods are characterized by accuracies below 70%. Finally, the Naïve Bayes method is shown to provide the highest sensitivity for the prediction of flexible regions, while FlexRP and SVM give the highest sensitivity for rigid regions.ConclusionA new sequence representation that uses k-spaced amino acid pairs is shown to be the most efficient in the prediction of the flexible/rigid regions of protein sequences. The proposed FlexRP method provides the highest prediction accuracy of about 80%. The experimental tests show that the FlexRP and SVM methods achieved high overall accuracy and the highest sensitivity for rigid regions, while the best quality of the predictions for flexible regions is achieved by the Naïve Bayes method.

[1]  Ke Chen,et al.  Quantitative Analysis of the Conservation of the Tertiary Structure of Protein Segments , 2006, The protein journal.

[2]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[3]  Xiaoyong Zou,et al.  Using pseudo-amino acid composition and support vector machine to predict protein structural class. , 2006, Journal of theoretical biology.

[4]  D. Manstein,et al.  Molecular mechanism of actomyosin-based motility , 2005, Cellular and Molecular Life Sciences CMLS.

[5]  Zsuzsanna Dosztányi,et al.  IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content , 2005, Bioinform..

[6]  Burkhard Rost,et al.  NORSp: predictions of long regions without regular secondary structure , 2003, Nucleic Acids Res..

[7]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[8]  Zoran Obradovic,et al.  Predicting intrinsic disorder from amino acid sequence , 2003, Proteins.

[9]  Charles L Brooks,et al.  Protein and peptide folding explored with molecular simulations. , 2002, Accounts of chemical research.

[10]  B. Horazdovsky,et al.  Vps9 domain-containing proteins: activators of Rab5 GTPases from yeast to neurons. , 2006, Trends in cell biology.

[11]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[12]  Ke Chen,et al.  Prediction of Three Dimensional Structure of Calmodulin , 2006, The protein journal.

[13]  J. Adelman,et al.  Structure of the gating domain of a Ca2+-activated K+ channel complexed with Ca2+/calmodulin , 2001, Nature.

[14]  P. Pochet A Quantitative Analysis , 2006 .

[15]  Garland R. Marshall,et al.  A potential smoothing algorithm accurately predicts transmembrane helix packing , 1999, Nature Structural Biology.

[16]  Mikael Bodén,et al.  Prediction of protein continuum secondary structure with probabilistic models based on NMR solved structures , 2006, BMC Bioinformatics.

[17]  Ruth Nussinov,et al.  FlexProt: Alignment of Flexible Protein Structures Without a Predefinition of Hinge Regions , 2004, J. Comput. Biol..

[18]  Hongzhi Li A model of local‐minima distribution on conformational space and its application to protein structure prediction , 2006, Proteins.

[19]  Adam Godzik,et al.  FATCAT: a web server for flexible structure comparison and structure similarity searching , 2004, Nucleic Acids Res..

[20]  Victor Muñoz,et al.  Atom-by-atom analysis of global downhill protein folding , 2006, Nature.

[21]  J. Sellers,et al.  Walking with myosin V. , 2006, Current opinion in cell biology.

[22]  Mitsuhiko Ikura,et al.  Structural basis for simultaneous binding of two carboxy-terminal peptides of plant glutamate decarboxylase to calmodulin. , 2003, Journal of molecular biology.

[23]  Stephen A Baldwin,et al.  Nucleoside transporters: from scavengers to novel therapeutic targets. , 2006, Trends in pharmacological sciences.

[24]  Christian Freund,et al.  The GYF domain , 2006, The FEBS journal.

[25]  Milan Hodoscek,et al.  Unfolding of the cold shock protein studied with biased molecular dynamics , 2003, Proteins.

[26]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[27]  E I Shakhnovich,et al.  Is burst hydrophobic collapse necessary for protein folding? , 1995, Biochemistry.

[28]  C. Müller,et al.  Karyopherin flexibility in nucleocytoplasmic transport. , 2006, Current opinion in structural biology.

[29]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[30]  B. Rost,et al.  Protein flexibility and rigidity predicted from sequence , 2005, Proteins.

[31]  Mikael Bodén,et al.  Identifying sequence regions undergoing conformational change via predicted continuum secondary structure , 2006, Bioinform..

[32]  Tomer Hertz,et al.  PepDist: A New Framework for Protein-Peptide Binding Prediction based on Learning Peptide Distance Functions , 2006, BMC Bioinformatics.

[33]  Jon M. Kleinberg,et al.  Fast Detection of Common Geometric Substructure in Proteins , 1999, J. Comput. Biol..

[34]  Zoran Obradovic,et al.  The protein trinity—linking function and disorder , 2001, Nature Biotechnology.

[35]  G J Williams,et al.  The Protein Data Bank: a computer-based archival file for macromolecular structures. , 1978, Archives of biochemistry and biophysics.

[36]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[37]  Mingshan Li,et al.  The carboxyl‐terminal linker is important for chemoreceptor function , 2006, Molecular microbiology.

[38]  Zheng Yuan,et al.  Better prediction of protein contact number using a support vector regression analysis of amino acid sequence , 2005, BMC Bioinformatics.

[39]  Scott Dick,et al.  Classifier ensembles for protein structural class prediction with varying homology. , 2006, Biochemical and biophysical research communications.

[40]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[41]  Zenon Grabarek,et al.  Structural basis for diversity of the EF-hand calcium-binding proteins. , 2006, Journal of molecular biology.

[42]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[43]  Robert L. Baldwin,et al.  NMR evidence for an early framework intermediate on the folding pathway of ribonuclease A , 1988, Nature.

[44]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[45]  Jiangning Song,et al.  Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information , 2006, BMC Bioinformatics.

[46]  Muhammad H Zaman,et al.  How flexible is alpha-actinin's rod domain? , 2004, Mechanics & chemistry of biosystems : MCB.

[47]  P. Yeagle,et al.  A conformational trigger for activation of a G protein by a G protein-coupled receptor. , 2003, Biochemistry.

[48]  Woei-Jyh Lee,et al.  Evaluation of domain prediction in CASP6 , 2005, Proteins.

[49]  Philip E. Bourne,et al.  Wiggle—Predicting Functionally Flexible Regions from Primary Sequence , 2006, PLoS Comput. Biol..

[50]  Lukasz A. Kurgan,et al.  Optimization of the Sliding Window Size for Protein Structure Prediction , 2006, 2006 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology.

[51]  M. Gerstein,et al.  A database of macromolecular motions. , 1998, Nucleic acids research.

[52]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[53]  Weizhong Li,et al.  Modeling the third loop of short‐chain snake venom neurotoxins: Roles of the short‐range and long‐range interactions , 2001, Proteins.

[54]  Mark Gerstein,et al.  Tools and databases to analyze protein flexibility; approaches to mapping implied features onto sequences. , 2003, Methods in enzymology.

[55]  V. Uversky,et al.  Why are “natively unfolded” proteins unstructured under physiologic conditions? , 2000, Proteins.

[56]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[57]  Richard Bonneau,et al.  Improving the performance of rosetta using multiple sequence alignment information and global measures of hydrophobic core formation , 2001, Proteins.

[58]  Katherine A. Fitzgerald,et al.  Sorting out Toll Signals , 2006, Cell.