Going beyond Clustering in MD Trajectory Analysis: An Application to Villin Headpiece Folding

Recent advances in computing technology have enabled microsecond long all-atom molecular dynamics (MD) simulations of biological systems. Methods that can distill the salient features of such large trajectories are now urgently needed. Conventional clustering methods used to analyze MD trajectories suffer from various setbacks, namely (i) they are not data driven, (ii) they are unstable to noise and changes in cut-off parameters such as cluster radius and cluster number, and (iii) they do not reduce the dimensionality of the trajectories, and hence are unsuitable for finding collective coordinates. We advocate the application of principal component analysis (PCA) and a non-metric multidimensional scaling (nMDS) method to reduce MD trajectories and overcome the drawbacks of clustering. To illustrate the superiority of nMDS over other methods in reducing data and reproducing salient features, we analyze three complete villin headpiece folding trajectories. Our analysis suggests that the folding process of the villin headpiece is structurally heterogeneous.

[1]  Y-h. Taguchi,et al.  Nonmetric Multidimensional Scaling As a Data‐Mining Tool: New Algorithm and New Targets , 2005 .

[2]  X. Daura,et al.  Peptide Folding: When Simulation Meets Experiment , 1999 .

[3]  Y-h. Taguchi,et al.  Some implications of renormalization group theoretical ideas to statistics , 2005 .

[4]  García,et al.  Large-amplitude nonlinear motions in proteins. , 1992, Physical review letters.

[5]  M Levitt,et al.  Molecular dynamics of native protein. II. Analysis and nature of motion. , 1983, Journal of molecular biology.

[6]  Paul E. Green,et al.  Multidimensional Scaling: Concepts and Applications , 1989 .

[7]  V. Pande,et al.  Heterogeneity even at the speed limit of folding: large-scale molecular dynamics study of a fast-folding variant of the villin headpiece. , 2007, Journal of molecular biology.

[8]  C. Brooks,et al.  Statistical clustering techniques for the analysis of long molecular dynamics trajectories: analysis of 2.2-ns trajectories of YPGDV. , 1993, Biochemistry.

[9]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[10]  Eric J. Deeds,et al.  Understanding ensemble protein folding at atomic detail , 2006, Proceedings of the National Academy of Sciences.

[11]  Gerrit Groenhof,et al.  GROMACS: Fast, flexible, and free , 2005, J. Comput. Chem..

[12]  Peter L. Freddolino,et al.  Common structural transitions in explicit-solvent simulations of villin headpiece folding. , 2009, Biophysical journal.

[13]  R. Shepard The analysis of proximities: Multidimensional scaling with an unknown distance function. II , 1962 .

[14]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[15]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[16]  Ivet Bahar,et al.  Principal component analysis of native ensembles of biomolecular structures (PCA_NEST): insights into functional dynamics , 2009, Bioinform..

[17]  Satwik Rajaram Phenomenological approaches to the analysis of high-throughput biological experiments , 2009 .

[18]  Motonori Ota,et al.  Phylogeny of protein-folding trajectories reveals a unique pathway to native structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  R. Dror,et al.  Microsecond molecular dynamics simulation shows effect of slow loop dynamics on backbone amide order parameters of proteins. , 2008, The journal of physical chemistry. B.

[20]  Amedeo Caflisch,et al.  One-dimensional barrier-preserving free-energy projections of a beta-sheet miniprotein: new insights into the folding process. , 2008, The journal of physical chemistry. B.

[21]  R. Shepard The analysis of proximities: Multidimensional scaling with an unknown distance function. I. , 1962 .

[22]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[23]  K. Schulten,et al.  Principal Component Analysis and Long Time Protein Dynamics , 1996 .

[24]  P. Kollman,et al.  Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. , 1998, Science.

[25]  Y-h. Taguchi,et al.  Relational patterns of gene expression via non-metric multidimensional scaling analysis , 2004, Bioinform..

[26]  Oliver F. Lange,et al.  Recognition Dynamics Up to Microseconds Revealed from an RDC-Derived Ubiquitin Ensemble in Solution , 2008, Science.

[27]  Satwik Rajaram,et al.  A novel meta-analysis method exploiting consistency of high-throughput experiments , 2009, Bioinform..

[28]  A. Li,et al.  Identification and characterization of the unfolding transition state of chymotrypsin inhibitor 2 by molecular dynamics simulations. , 1996, Journal of molecular biology.

[29]  Satwik Rajaram,et al.  NeatMap - non-clustering heat map alternatives in R , 2010, BMC Bioinformatics.

[30]  J Wang,et al.  2D Entropy of Discrete Molecular Ensembles. , 2006, Journal of chemical theory and computation.

[31]  Jayant B Udgaonkar,et al.  Multiple routes and structural heterogeneity in protein folding. , 2008, Annual review of biophysics.

[32]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[33]  J. Hofrichter,et al.  Sub-microsecond protein folding. , 2006, Journal of molecular biology.

[34]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[35]  J. Hofrichter,et al.  The protein folding 'speed limit'. , 2004, Current opinion in structural biology.