Clustering algorithms for identifying core atom sets and for assessing the precision of protein structure ensembles

An important open question in the field of NMR‐based biomolecular structure determination is how best to characterize the precision of the resulting ensemble of structures. Typically, the RMSD, as minimized in superimposing the ensemble of structures, is the preferred measure of precision. However, the presence of poorly determined atomic coordinates and multiple “RMSD‐stable domains”—locally well‐defined regions that are not aligned in global superimpositions—complicate RMSD calculations. In this paper, we present a method, based on a novel, structurally defined order parameter, for identifying a set of core atoms to use in determining superimpositions for RMSD calculations. In addition we present a method for deciding whether to partition that core atom set into “RMSD‐stable domains” and, if so, how to determine partitioning of the core atom set. We demonstrate our algorithm and its application in calculating statistically sound RMSD values by applying it to a set of NMR‐derived structural ensembles, superimposing each RMSD‐stable domain (or the entire core atom set, where appropriate) found in each protein structure under consideration. A parameter calculated by our algorithm using a novel, kurtosis‐based criterion, the ϵ‐value, is a measure of precision of the superimposition that complements the RMSD. In addition, we compare our algorithm with previously described algorithms for determining core atom sets. The methods presented in this paper for biomolecular structure superimposition are quite general, and have application in many areas of structural bioinformatics and structural biology. Proteins 2005. © 2005 Wiley‐Liss, Inc.

[1]  E. R. Cohen An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements , 1998 .

[2]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[3]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[4]  László Patthy,et al.  Protein Evolution by Exon-Shuffling , 1995 .

[5]  L A Kelley,et al.  OLDERADO: On‐line database of ensemble representatives and domains , 1997, Protein science : a publication of the Protein Society.

[6]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.

[7]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[8]  Mark Gerstein,et al.  Using a measure of structural variation to define a core for the globins , 1995, Comput. Appl. Biosci..

[9]  L. Kelley,et al.  An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally related subfamilies. , 1996, Protein engineering.

[10]  Hans-Hermann Bock,et al.  PROBABILITY MODELS AND HYPOTHESES TESTING IN PARTITIONING CLUSTER ANALYSIS , 1996 .

[11]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[12]  H Frauenfelder,et al.  Variations on a theme by Debye and Waller: From simple crystals to proteins , 1997, Proteins.

[13]  W. Kabsch A solution for the best rotation to relate two sets of vectors , 1976 .

[14]  Michael Lappe,et al.  A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3 , 2001, Nucleic Acids Res..

[15]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[16]  Gert Vriend,et al.  The precision of NMR structure ensembles revisited , 2003, Journal of biomolecular NMR.

[17]  W. Johnson,et al.  A Bayesian perspective on the Bonferroni adjustment , 1997 .

[18]  Anders Liljas,et al.  Recognition of structural domains in globular proteins , 1974 .

[19]  I. Gelfand,et al.  Geometric invariant core for the V(L) and V(H) domains of immunoglobulin molecules. , 1998, Protein engineering.

[20]  T F Havel,et al.  The solution structure of eglin c based on measurements of many NOEs and coupling constants and its comparison with X‐ray structures , 1992, Protein science : a publication of the Protein Society.

[21]  J. Prestegard,et al.  New techniques in structural NMR — anisotropic interactions , 1998, Nature Structural Biology.

[22]  L. Kelley,et al.  An automated approach for defining core atoms and domains in an ensemble of NMR-derived protein structures. , 1997, Protein engineering.

[23]  C. Sander,et al.  Protein structure comparison by alignment of distance matrices. , 1993, Journal of molecular biology.

[24]  K Wüthrich,et al.  NMR - this other method for protein and nucleic acid structure determination. , 1995, Acta crystallographica. Section D, Biological crystallography.

[25]  W. Kabsch A discussion of the solution for the best rotation to relate two sets of vectors , 1978 .

[26]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[27]  Gregory S. Chirikjian,et al.  Normal mode analysis of proteins: a comparison of rigid cluster modes with Cα coarse graining , 2004 .