High sequence identity between two proteins (e.g. > 60%) is a strong evidence for high structural similarity. However, internal shifts in one of the two proteins can sometimes give rise to unexpectedly high structural differences. This, in turn, causes unreliable structure predictions when two such proteins are used in homology modeling. Here, we perform a computational analysis of helix shifts and we show that their occurrence can be predicted with statistical learning methods. Our results indicate that helix shifts increase the RMS error by factor 2.6 compared to those protein pairs without a helix shift. Although helix shifts are rare (1.6% of helices and a commensurately higher number of proteins are affected), they therefore pose a significant problem for reliable structure prediction systems. In this paper, we prototype a new approach for model quality assessment and demonstrate that it can successfully warn against helix shifts. A support vector machine trained on a wide range of sequence and structure properties predicts the occurrence of helix shifts with a sensitivity of 74.2% and a specificity of 83.6%. On an equalized test dataset, this corresponds to an accuracy of 78.9%. Projected to the full dataset, it translates to an accuracy of 83.4%. Our analysis shows that helix shift detection is a valuable building block for highly reliable structure prediction systems. Furthermore, the statistical learning based approach to helix shift detection that we employ here is orthogonal to well-established model quality assessment methods (which use geometric constraint checking or mean force potentials). Therefore, a further increase of prediction accuracy is expected from the combination of these methods.
[1]
T. N. Bhat,et al.
The Protein Data Bank
,
2000,
Nucleic Acids Res..
[2]
A. Lesk,et al.
Helix movements in proteins
,
1985
.
[3]
Marc A. Martí-Renom,et al.
Tools for comparative protein structure modeling and analysis
,
2003,
Nucleic Acids Res..
[4]
Manuel C. Peitsch,et al.
SWISS-MODEL: an automated protein homology-modeling server
,
2003,
Nucleic Acids Res..
[5]
P E Bourne,et al.
Protein structure alignment by incremental combinatorial extension (CE) of the optimal path.
,
1998,
Protein engineering.
[6]
J. Ross Quinlan,et al.
C4.5: Programs for Machine Learning
,
1992
.
[7]
R Leplae,et al.
Analysis and assessment of comparative modeling predictions in CASP4
,
2001,
Proteins.
[8]
Vladimir N. Vapnik,et al.
The Nature of Statistical Learning Theory
,
2000,
Statistics for Engineering and Information Science.