Finding the needle in a haystack: educing native folds from ambiguous ab initio protein structure predictions

Current ab initio structure-prediction methods are sometimes able to generate families of folds, one of which is native, but are unable to single out the native one due to imperfections in the folding potentials and an inability to conduct thorough explorations of the conformational space. To address this issue, here we describe a method for the detection of statistically significant folds from a pool of predicted structures. Our approach consists of clustering and averaging the structures into representative fold families. Using a metric derived from the root-mean-square distance (RMSD) that is less sensitive to protein size, we determine whether the simulated structures are clustered in relation to a group of random structures. The clustering method searches for cluster centers and iteratively calculates the clusters and their respective centroids. The centroid interresidue distances are adjusted by minimizing a potential constructed from the corresponding average distances of the cluster structures. Application of this method to selected proteins shows that it can detect the best fold family that is closest to native, along with several other misfolded families. We also describe a method to obtain substructures. This is useful when the folding simulation fails to give a total topology prediction but produces common subelements among the structures. We have created a web server that clusters user submitted structures, which can be found at http://bioinformatics.danforthcenter.org/services/scar. c © 2001 John Wiley & Sons, Inc. J Comput Chem 22: 339–353, 2001

[1]  J. Gower,et al.  Methods for statistical data analysis of multivariate observations , 1977, A Wiley publication in applied statistics.

[2]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .