Optimizing Long Intrinsic Disorder Predictors with Protein Evolutionary Information

Protein existing as an ensemble of structures, called intrinsically disordered, has been shown to be responsible for a wide variety of biological functions and to be common in nature. Here we focus on improving sequence-based predictions of long (>30 amino acid residues) regions lacking specific 3-D structure by means of four new neural-network-based Predictors Of Natural Disordered Regions (PONDRs): VL3, VL3H, VL3P, and VL3E. PONDR VL3 used several features from a previously introduced PONDR VL2, but benefitted from optimized predictor models and a slightly larger (152 vs. 145) set of disordered proteins that were cleaned of mislabeling errors found in the smaller set. PONDR VL3H utilized homologues of the disordered proteins in the training stage, while PONDR VL3P used attributes derived from sequence profiles obtained by PSI-BLAST searches. The measure of accuracy was the average between accuracies on disordered and ordered protein regions. By this measure, the 30-fold cross-validation accuracies of VL3, VL3H, and VL3P were, respectively, 83.6 +/- 1.4%, 85.3 +/- 1.4%, and 85.2 +/- 1.5%. By combining VL3H and VL3P, the resulting PONDR VL3E achieved an accuracy of 86.7 +/- 1.4%. This is a significant improvement over our previous PONDRs VLXT (71.6 +/- 1.3%) and VL2 (80.9 +/- 1.4%). The new disorder predictors with the corresponding datasets are freely accessible through the web server at http://www.ist.temple.edu/disprot.

[1]  J M Chandonia,et al.  Neural networks for secondary structure and structural class predictions , 1995, Protein science : a publication of the Protein Society.

[2]  Ian H. Witten,et al.  Protein is incompressible , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[3]  David T. Jones,et al.  Prediction of disordered regions in proteins from position specific score matrices , 2003, Proteins.

[4]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[5]  S. Sudarsanam,et al.  Structural diversity of sequentially identical subsequences of proteins , 1996 .

[6]  Obradovic,et al.  Predicting Protein Disorder for N-, C-, and Internal Regions. , 1999, Genome informatics. Workshop on Genome Informatics.

[7]  O. Ptitsyn,et al.  Alpha-Lactalbumin: compact state with fluctuating tertiary structure? , 1981, FEBS letters.

[8]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[9]  Zoran Obradovic,et al.  Improving Sequence Alignments For Intrinsically Disordered Proteins , 2001, Pacific Symposium on Biocomputing.

[10]  C Sander,et al.  On the use of sequence homologies to predict protein structure: identical pentapeptides can have completely different conformations. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[11]  P. Romero,et al.  Sequence complexity of disordered protein , 2001, Proteins.

[12]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[13]  S. Vucetic,et al.  Flavors of protein disorder , 2003, Proteins.

[14]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[15]  V. Uversky,et al.  Why are “natively unfolded” proteins unstructured under physiologic conditions? , 2000, Proteins.

[16]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[17]  O. Ptitsyn,et al.  α‐lactalbumin: compact state with fluctuating tertiary structure? , 1981 .

[18]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[20]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[21]  P. Radivojac,et al.  Improved amino acid flexibility parameters , 2003, Protein science : a publication of the Protein Society.

[22]  Z. Obradovic,et al.  Identification and functions of usefully disordered proteins. , 2002, Advances in protein chemistry.

[23]  H. Dyson,et al.  Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. , 1999, Journal of molecular biology.

[24]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[25]  A. Demchenko,et al.  Recognition between flexible protein molecules: induced and assisted folding † , 2001, Journal of molecular recognition : JMR.

[26]  M. Vihinen,et al.  Accuracy of protein flexibility predictions , 1994, Proteins.

[27]  P. Radivojac,et al.  Protein flexibility and intrinsic disorder , 2004, Protein science : a publication of the Protein Society.

[28]  Gary W. Daughdrill,et al.  The C-terminal half of the anti-sigma factor, FlgM, becomes structured when bound to its target, σ28 , 1997, Nature Structural Biology.

[29]  Zoran Obradovic,et al.  Reranking Medline Citations by Relevance to a Difficult Biological Query , 2003, Neural Networks and Computational Intelligence.

[30]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[31]  S. Sudarsanam,et al.  Structural diversity of sequentially identical subsequences of proteins: Identical octapeptides can have different conformations , 1998, Proteins.

[32]  Obradovic,et al.  Predicting Disordered Regions from Amino Acid Sequence: Common Themes Despite Differing Structural Characterization. , 1998, Genome informatics. Workshop on Genome Informatics.

[33]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[34]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[35]  Zoran Obradovic,et al.  Exploiting unlabeled data for improving accuracy of predictive data mining , 2003, Third IEEE International Conference on Data Mining.

[36]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[37]  V. Uversky Intrinsically Disordered Proteins , 2000 .

[38]  P. Argos,et al.  Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. , 1996, Protein engineering.

[39]  Zoran Obradovic,et al.  Detection of Underrepresented Biological Sequences using Class-Conditional Distribution Models , 2003, SDM.

[40]  Zoran Obradovic,et al.  Predicting intrinsic disorder from amino acid sequence , 2003, Proteins.

[41]  Zoran Obradovic,et al.  The protein trinity—linking function and disorder , 2001, Nature Biotechnology.

[42]  R. King,et al.  Identification and application of the concepts important for accurate and reliable protein secondary structure prediction , 1996, Protein science : a publication of the Protein Society.

[43]  F. Dahlquist,et al.  The C-terminal half of the anti-sigma factor FlgM contains a dynamic equilibrium solution structure favoring helical conformations. , 1998, Biochemistry.

[44]  T. Gibson,et al.  Protein disorder prediction: implications for structural proteomics. , 2003, Structure.

[45]  O. Ptitsyn,et al.  The molten globule is a third thermodynamical state of protein molecules , 1994, FEBS letters.

[46]  G. Pielak,et al.  FlgM gains structure in living cells , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Zoran Obradovic,et al.  Intelligent Data Analysis for Protein Disorder Prediction , 2000, Artificial Intelligence Review.

[48]  J. MacKinnon,et al.  Estimation and inference in econometrics , 1994 .

[49]  P. S. Kim,et al.  Context-dependent secondary structure formation of a designed protein sequence , 1996, Nature.

[50]  Obradovic,et al.  Predicting Binding Regions within Disordered Proteins. , 1999, Genome informatics. Workshop on Genome Informatics.

[51]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[52]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[53]  S F Howard,et al.  Molecular characterization of the hdm2-p53 interaction. , 1997, Journal of molecular biology.

[54]  B. Rost,et al.  Alignments grow, secondary structure prediction improves , 2002, Proteins.

[55]  John Moult,et al.  Evaluation of disorder predictions in CASP5 , 2003, Proteins.

[56]  Robert B. Russell,et al.  GlobPlot: exploring protein sequences for globularity and disorder , 2003, Nucleic Acids Res..

[57]  Burkhard Rost,et al.  NORSp: predictions of long regions without regular secondary structure , 2003, Nucleic Acids Res..

[58]  R. J. Williams The conformational mobility of proteins and its functional significance. , 1978, Biochemical Society transactions.

[59]  L. Iakoucheva,et al.  Intrinsic Disorder and Protein Function , 2002 .

[60]  A.K. Dunker,et al.  Identifying disordered regions in proteins from amino acid sequence , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[61]  D. Hamada,et al.  The equilibrium intermediate of beta-lactoglobulin with non-native alpha-helical structure. , 1997, Journal of molecular biology.

[62]  V. Uversky Natively unfolded proteins: A point where biology waits for physics , 2002, Protein science : a publication of the Protein Society.

[63]  Christopher J. Oldfield,et al.  Evolutionary Rate Heterogeneity in Proteins with Long Disordered Regions , 2002, Journal of Molecular Evolution.

[64]  Vladimir N Uversky,et al.  What does it mean to be natively unfolded? , 2002, European journal of biochemistry.

[65]  John C. Wootton,et al.  Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases , 1993, Comput. Chem..