Large-scale prediction of long disordered regions in proteins using random forests

BackgroundMany proteins contain disordered regions that lack fixed three-dimensional (3D) structure under physiological conditions but have important biological functions. Prediction of disordered regions in protein sequences is important for understanding protein function and in high-throughput determination of protein structures. Machine learning techniques, including neural networks and support vector machines have been widely used in such predictions. Predictors designed for long disordered regions are usually less successful in predicting short disordered regions. Combining prediction of short and long disordered regions will dramatically increase the complexity of the prediction algorithm and make the predictor unsuitable for large-scale applications. Efficient batch prediction of long disordered regions alone is of greater interest in large-scale proteome studies.ResultsA new algorithm, IUPforest-L, for predicting long disordered regions using the random forest learning model is proposed in this paper. IUPforest-L is based on the Moreau-Broto auto-correlation function of amino acid indices (AAIs) and other physicochemical features of the primary sequences. In 10-fold cross validation tests, IUPforest-L can achieve an area of 89.5% under the receiver operating characteristic (ROC) curve. Compared with existing disorder predictors, IUPforest-L has high prediction accuracy and is efficient for predicting long disordered regions in large-scale proteomes.ConclusionThe random forest model based on the auto-correlation functions of the AAIs within a protein fragment and other physicochemical features could effectively detect long disordered regions in proteins. A new predictor, IUPforest-L, was developed to batch predict long disordered regions in proteins, and the server can be accessed from http://dmg.cs.rmit.edu.au/IUPforest/IUPforest-L.php

[1]  A. Savitzky,et al.  Smoothing and Differentiation of Data by Simplified Least Squares Procedures. , 1964 .

[2]  C. Tanford,et al.  The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions. Establishment of a hydrophobicity scale. , 1971, The Journal of biological chemistry.

[3]  H. Bull,et al.  Surface tension of amino acid solutions: a hydrophobicity scale of the amino acid residues. , 1974, Archives of biochemistry and biophysics.

[4]  S. Rackovsky,et al.  Hydrophobicity, hydrophilicity, and the radial and orientational distributions of residues in native proteins. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[5]  M. Oobatake,et al.  An analysis of non-bonded energy of proteins. , 1977, Journal of theoretical biology.

[6]  P. Y. Chou,et al.  Prediction of the secondary structure of proteins from their amino acid sequence. , 2006 .

[7]  P. Ponnuswamy,et al.  Hydrophobic character of amino acid residues in globular proteins , 1978, Nature.

[8]  C. Sander,et al.  Antiparallel and parallel beta-strands differ in amino acid residue preferences. , 1979, Nature.

[9]  P. Ponnuswamy,et al.  Hydrophobic packing and spatial arrangement of amino acid residues in globular proteins. , 1980, Biochimica et biophysica acta.

[10]  S. Rackovsky,et al.  Empirical Studies of Hydrophobicity. 1. Effect of Protein Size on the Hydrophobic Behavior of Amino Acids , 1980 .

[11]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[12]  A. Finkelstein,et al.  Theory of protein secondary structure and algorithm of its prediction , 1983, Biopolymers.

[13]  T. Venanzi Hydrophobicity parameters and the bitter taste of L-amino acids. , 1984, Journal of theoretical biology.

[14]  H. Guy Amino acid side-chain partition energies and distribution of residues in soluble proteins. , 1985, Biophysical journal.

[15]  R. Hodges,et al.  New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. , 1986, Biochemistry.

[16]  K. Nishikawa,et al.  Radial locations of amino acid residues in a globular protein: correlation with the sequence. , 1986, Journal of biochemistry.

[17]  S. Wold,et al.  Principal property values for six non-natural amino acids and their application to a structure–activity relationship for oxytocin peptide analogues , 1987 .

[18]  G. Fasman Prediction of Protein Structure and the Principles of Protein Conformation , 2012, Springer US.

[19]  H. Cid,et al.  Hydrophobicity and structural classes in proteins. , 1992, Protein engineering.

[20]  P. Ponnuswamy Hydrophobic characteristics of folded proteins. , 1993, Progress in biophysics and molecular biology.

[21]  U. Hobohm,et al.  Enlarged representative set of protein structures , 1994, Protein science : a publication of the Protein Society.

[22]  M. Vihinen,et al.  Accuracy of protein flexibility predictions , 1994, Proteins.

[23]  V. Muñoz,et al.  Intrinsic secondary structure propensities of the amino acids, using statistical phi-psi matrices: comparison with experimental scales. , 1994, Proteins.

[24]  C. Zhang,et al.  Prediction of protein (domain) structural classes based on amino-acid index. , 1999, European journal of biochemistry.

[25]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[26]  Obradovic,et al.  Predicting Protein Disorder for N-, C-, and Internal Regions. , 1999, Genome informatics. Workshop on Genome Informatics.

[27]  R. Jernigan,et al.  Self‐consistent estimation of inter‐residue protein contact energies based on an equilibrium mixture approximation of residues , 1999, Proteins.

[28]  H. Dyson,et al.  Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. , 1999, Journal of molecular biology.

[29]  C. Zhang,et al.  Prediction of Membrane Protein Types Based on the Hydrophobic Index of Amino Acids , 2000, Journal of protein chemistry.

[30]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[31]  V. Uversky Intrinsically Disordered Proteins , 2000 .

[32]  S. Parthasarathy,et al.  Protein thermal stability: insights from atomic displacement parameters (B values). , 2000, Protein engineering.

[33]  K. Namba Roles of partly unfolded conformations in macromolecular self‐assembly , 2001, Genes to cells : devoted to molecular & cellular mechanisms.

[34]  H Naderi-Manesh,et al.  Prediction of protein surface accessibility with information theory. , 2000, Proteins.

[35]  Christopher J. Oldfield,et al.  Intrinsically disordered protein. , 2001, Journal of molecular graphics & modelling.

[36]  P. Tompa Intrinsically unstructured proteins. , 2002, Trends in biochemical sciences.

[37]  B. Rost,et al.  Loopy proteins appear conserved in evolution. , 2002, Journal of molecular biology.

[38]  Robert B. Russell,et al.  GlobPlot: exploring protein sequences for globularity and disorder , 2003, Nucleic Acids Res..

[39]  A. Maritan,et al.  A knowledge‐based scale for amino acid membrane propensity , 2002, Proteins.

[40]  David T. Jones,et al.  Prediction of disordered regions in proteins from position specific score matrices , 2003, Proteins.

[41]  Burkhard Rost,et al.  NORSp: predictions of long regions without regular secondary structure , 2003, Nucleic Acids Res..

[42]  R. Nussinov,et al.  Extended disordered proteins: targeting function with less scaffold. , 2003, Trends in biochemical sciences.

[43]  S. Vucetic,et al.  Flavors of protein disorder , 2003, Proteins.

[44]  T. Gibson,et al.  Protein disorder prediction: implications for structural proteomics. , 2003, Structure.

[45]  P. Radivojac,et al.  Improved amino acid flexibility parameters , 2003, Protein science : a publication of the Protein Society.

[46]  Bernard F. Buxton,et al.  The DISOPRED server for the prediction of protein disorder , 2004, Bioinform..

[47]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[48]  Rebecca Thomson,et al.  Prediction of Natively Disordered Regions in Proteins Using a Bio-basis Function Neural Network , 2004, IDEAL.

[49]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[50]  Hongyi Zhou,et al.  Quantifying the effect of burial of amino acid residues on protein stability , 2003, Proteins.

[51]  M. Y. Lobanov,et al.  To be folded or to be unfolded? , 2004, Protein science : a publication of the Protein Society.

[52]  Zheng Rong Yang,et al.  RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins , 2005, Bioinform..

[53]  U. Bastolla,et al.  Principal eigenvector of contact matrices and hydrophobicity profiles in proteins , 2004, Proteins.

[54]  Christopher J. Oldfield,et al.  Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling , 2005, Journal of molecular recognition : JMR.

[55]  Zoran Obradovic,et al.  DisProt: a database of protein disorder , 2005, Bioinform..

[56]  P. Tompa,et al.  The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. , 2005, Journal of molecular biology.

[57]  Marc S. Cortese,et al.  Coupled folding and binding with alpha-helix-forming molecular recognition elements. , 2005, Biochemistry.

[58]  Roland L. Dunbrack,et al.  Assessment of disorder predictions in CASP6 , 2005, Proteins.

[59]  Marc S. Cortese,et al.  Coupled folding and binding with α-helix-forming molecular recognition elements , 2005 .

[60]  P. Tompa The interplay between structure and function in intrinsically unstructured proteins , 2005, FEBS letters.

[61]  J. Beckmann,et al.  FoldIndex©: a simple tool to predict whether a given protein sequence is intrinsically unfolded , 2005 .

[62]  Anne Poupon,et al.  Prediction of unfolded segments in a protein sequence based on amino acid composition , 2005, Bioinform..

[63]  Pierre Baldi,et al.  Accurate Prediction of Protein Disordered Regions by Mining Protein Structure Data , 2005, Data Mining and Knowledge Discovery.

[64]  P. Radivojac,et al.  PROTEINS: Structure, Function, and Bioinformatics Suppl 7:176–182 (2005) Exploiting Heterogeneous Sequence Properties Improves Prediction of Protein Disorder , 2022 .

[65]  Jaime Prilusky,et al.  FoldIndex copyright: a simple tool to predict whether a given protein sequence is intrinsically unfolded , 2005, Bioinform..

[66]  U. Bastolla,et al.  Looking at structure, stability, and evolution of proteins through the principal eigenvector of contact matrices and hydrophobicity profiles. , 2004, Gene.

[67]  Zoran Obradovic,et al.  Optimizing Long Intrinsic Disorder Predictors with Protein Evolutionary Information , 2005, J. Bioinform. Comput. Biol..

[68]  Christopher J. Oldfield,et al.  Addressing the intrinsic disorder bottleneck in structural proteomics , 2005, Proteins.

[69]  P. Tompa,et al.  Structural disorder throws new light on moonlighting. , 2005, Trends in biochemical sciences.

[70]  H. Dyson,et al.  Intrinsically unstructured proteins and their functions , 2005, Nature Reviews Molecular Cell Biology.

[71]  Zoran Obradovic,et al.  Length-dependent prediction of protein intrinsic disorder , 2006, BMC Bioinformatics.

[72]  Yu-Yen Ou,et al.  Protein disorder prediction by condensed PSSM considering propensity for order or disorder , 2006, BMC Bioinformatics.

[73]  Michail Yu. Lobanov,et al.  FoldUnfold: web server for the prediction of disordered regions in protein chain , 2006, Bioinform..

[74]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[75]  Silvio C. E. Tosatto,et al.  Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines , 2006, Nucleic Acids Res..

[76]  A. Dunker,et al.  Disorder and sequence repeats in hub proteins and their implications for network evolution. , 2006, Journal of proteome research.

[77]  Avner Schlessinger,et al.  Natively unstructured regions in proteins identified from contact predictions , 2007, Bioinform..

[78]  Shuichi Hirose,et al.  BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btm330 Structural bioinformatics , 2022 .

[79]  Yutaka Kuroda,et al.  POODLE-L: a two-level SVM prediction system for reliably predicting long disordered regions , 2007, Bioinform..

[80]  Kengo Kinoshita,et al.  PrDOS: prediction of disordered protein regions from amino acid sequence , 2007, Nucleic Acids Res..

[81]  Christopher J. Oldfield,et al.  Intrinsic disorder and functional proteomics. , 2007, Biophysical journal.

[82]  T. Gibson,et al.  A careful disorderliness in the proteome: Sites for interaction and targets for future therapies , 2008, FEBS letters.

[83]  Kengo Kinoshita,et al.  Prediction of disordered regions in proteins based on the meta approach , 2008, Bioinform..

[84]  Xiuzhen Zhang,et al.  Predicting disordered regions in proteins using the profiles of amino acid indices , 2009, BMC Bioinformatics.