CATBOSS: Cluster Analysis of Trajectories Based on Segment Splitting

Molecular dynamics (MD) simulations are an exceedingly and increasingly potent tool for molecular behavior prediction and analysis. However, the enormous wealth of data generated by these simulations can be difficult to process and render in a human-readable fashion. Cluster analysis is a commonly used way to partition data into structurally distinct states. We present a method that improves on the state of the art by taking advantage of the temporal information of MD trajectories to enable more accurate clustering at a lower memory cost. To date, cluster analysis of MD simulations has generally treated simulation snapshots as a mere collection of independent data points and attempted to separate them into different clusters based on structural similarity. This new method, cluster analysis of trajectories based on segment splitting (CATBOSS), applies density-peak-based clustering to classify trajectory segments learned by change detection. Applying the method to a synthetic toy model as well as four real-life data sets–trajectories of MD simulations of alanine dipeptide and valine dipeptide as well as two fast-folding proteins–we find CATBOSS to be robust and highly performant, yielding natural-looking cluster boundaries and greatly improving clustering resolution. As the classification of points into segments emphasizes density gaps in the data by grouping them close to the state means, CATBOSS applied to the valine dipeptide system is even able to account for a degree of freedom deliberately omitted from the input data set. We also demonstrate the potential utility of CATBOSS in distinguishing metastable states from transition segments as well as promising application to cases where there is little or no advance knowledge of intrinsic coordinates, making for a highly versatile analysis tool.

[1]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[2]  Hongtao Yu,et al.  Insights into How Cyclic Peptides Switch Conformations. , 2016, Journal of chemical theory and computation.

[3]  Katrine Bugge,et al.  Extreme disorder in an ultrahigh-affinity protein complex , 2018, Nature.

[4]  S. Karabasov,et al.  Water-Peptide Dynamics during Conformational Transitions. , 2013, The journal of physical chemistry letters.

[5]  Haw Yang,et al.  Statistical Learning of Discrete States in Time Series. , 2018, The journal of physical chemistry. B.

[6]  Bartosz Kohnke,et al.  A GPU-Accelerated Fast Multipole Method for GROMACS: Performance and Accuracy , 2020, Journal of chemical theory and computation.

[7]  Alex Rodriguez,et al.  Automatic topography of high-dimensional data sets by non-parametric Density Peak clustering , 2018, Inf. Sci..

[8]  Carsten Kutzner,et al.  Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS , 2015, EASC.

[9]  William Swope,et al.  Describing Protein Folding Kinetics by Molecular Dynamics Simulations. 2. Example Applications to Alanine Dipeptide and a β-Hairpin Peptide† , 2004 .

[10]  Christine Peter,et al.  Towards a molecular basis of ubiquitin signaling: A dual-scale simulation study of ubiquitin dimers , 2018, PLoS Comput. Biol..

[11]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[12]  Song Liu,et al.  Adaptive partitioning by local density‐peaks: An efficient density‐based clustering algorithm for analyzing molecular dynamics trajectories , 2017, J. Comput. Chem..

[13]  Amedeo Caflisch,et al.  SAPPHIRE-based clustering. , 2020, Journal of chemical theory and computation.

[14]  Larry Wasserman,et al.  All of Statistics , 2004 .

[15]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[16]  Shoji Takada,et al.  DNA sliding in nucleosomes via twist defect propagation revealed by molecular simulations , 2018, Nucleic acids research.

[17]  Aaron R Dinner,et al.  Automatic method for identifying reaction coordinates in complex systems. , 2005, The journal of physical chemistry. B.

[18]  Carlos Reaño,et al.  Tuning remote GPU virtualization for InfiniBand networks , 2016, The Journal of Supercomputing.

[19]  Marcus Weber,et al.  Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification , 2013, Advances in Data Analysis and Classification.

[20]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[21]  J. Andrade,et al.  Statistical comparison of the slopes of two regression lines: A tutorial. , 2014, Analytica chimica acta.

[22]  Toni Giorgino,et al.  Identification of slow molecular order parameters for Markov model construction. , 2013, The Journal of chemical physics.

[23]  Amedeo Caflisch,et al.  A scalable algorithm to order and annotate continuous observations reveals the metastable states visited by dynamical systems , 2013, Comput. Phys. Commun..

[24]  Hao Wu,et al.  Variational Approach for Learning Markov Processes from Time Series Data , 2017, Journal of Nonlinear Science.

[25]  D. Wijaya,et al.  Information Quality Ratio as a novel metric for mother wavelet selection , 2017 .

[26]  Fu Kit Sheong,et al.  A fast parallel clustering algorithm for molecular simulation trajectories , 2013, J. Comput. Chem..

[27]  Gerhard Stock,et al.  Hierarchical folding free energy landscape of HP35 revealed by most probable path clustering. , 2014, The journal of physical chemistry. B.

[28]  J. Berg,et al.  Molecular dynamics simulations of biomolecules , 2002, Nature Structural Biology.

[29]  B. Keller,et al.  Density-based cluster algorithms for the identification of core sets. , 2016, The Journal of chemical physics.

[30]  Weixu,et al.  Effectiveness of the Euclidean distance in high dimensional spaces , 2015 .

[31]  Jianyin Shao,et al.  Clustering Molecular Dynamics Trajectories: 1. Characterizing the Performance of Different Clustering Algorithms. , 2007, Journal of chemical theory and computation.

[32]  E. Lindahl,et al.  Heterogeneous Parallelization and Acceleration of Molecular Dynamics Simulations in GROMACS , 2020, The Journal of chemical physics.

[33]  M. Karplus,et al.  Molecular dynamics simulations in biology , 1990, Nature.

[34]  Ioannis G Kevrekidis,et al.  Integrating diffusion maps with umbrella sampling: application to alanine dipeptide. , 2011, The Journal of chemical physics.

[35]  R. Dror,et al.  Long-timescale molecular dynamics simulations of protein structure and function. , 2009, Current opinion in structural biology.

[36]  Joshua A. Kritzer,et al.  Designing Well-Structured Cyclic Pentapeptides Based on Sequence-Structure Relationships. , 2018, The journal of physical chemistry. B.

[37]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[38]  C. Quesenberry,et al.  A nonparametric estimate of a multivariate density function , 1965 .

[39]  Roland L. Dunbrack,et al.  Conformational analysis of the backbone-dependent rotamer preferences of protein sidechains , 1994, Nature Structural Biology.

[40]  R. Hegger,et al.  Dihedral angle principal component analysis of molecular dynamics simulations. , 2007, The Journal of chemical physics.

[41]  Bruno L. Victor,et al.  Predicting the Thermodynamics and Kinetics of Helix Formation in a Cyclic Peptide Model. , 2013, Journal of chemical theory and computation.

[42]  Fu Kit Sheong,et al.  Automatic state partitioning for multibody systems (APM): an efficient algorithm for constructing Markov state models to elucidate conformational dynamics of multibody systems. , 2015, Journal of chemical theory and computation.

[43]  V. Spiwok,et al.  Time-Lagged t-Distributed Stochastic Neighbor Embedding (t-SNE) of Molecular Simulation Trajectories , 2020, Frontiers in Molecular Biosciences.

[44]  F. Jiang,et al.  Residue-specific force field based on protein coil library. RSFF2: modification of AMBER ff99SB. , 2015, The journal of physical chemistry. B.

[45]  Florian Sittel,et al.  Perspective: Identification of collective variables and metastable states of protein dynamics. , 2018, The Journal of chemical physics.

[46]  Michel Verleysen,et al.  The Curse of Dimensionality in Data Mining and Time Series Prediction , 2005, IWANN.

[47]  He Huang,et al.  Elucidating Solution Structures of Cyclic Peptides Using Molecular Dynamics Simulations. , 2021, Chemical reviews.

[48]  Cheng Tan,et al.  New parallel computing algorithm of molecular dynamics for extremely huge scale biological systems , 2020, J. Comput. Chem..

[49]  Stefano Piana,et al.  Identifying localized changes in large systems: Change-point detection for biomolecular simulations , 2015, Proceedings of the National Academy of Sciences.

[50]  Martin Fechner,et al.  More bang for your buck: Improved use of GPU nodes for GROMACS 2018 , 2019, J. Comput. Chem..

[51]  Ioannis G Kevrekidis,et al.  Intrinsic map dynamics exploration for uncharted effective free-energy landscapes , 2016, Proceedings of the National Academy of Sciences.

[52]  Joseph A. Bank,et al.  Supporting Online Material Materials and Methods Figs. S1 to S10 Table S1 References Movies S1 to S3 Atomic-level Characterization of the Structural Dynamics of Proteins , 2022 .

[53]  Berk Hess,et al.  GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers , 2015 .

[54]  Cecilia Clementi,et al.  Rapid exploration of configuration space with diffusion-map-directed molecular dynamics. , 2013, The journal of physical chemistry. B.

[55]  Frank Noé,et al.  PyEMMA 2: A Software Package for Estimation, Validation, and Analysis of Markov Models. , 2015, Journal of chemical theory and computation.

[56]  Martin Fechner,et al.  Best bang for your buck: GPU nodes for GROMACS biomolecular simulations , 2015, J. Comput. Chem..

[57]  G. Hummer,et al.  Coarse master equations for peptide folding dynamics. , 2008, The journal of physical chemistry. B.

[58]  F. Noé,et al.  Kinetic distance and kinetic maps from molecular dynamics simulation. , 2015, Journal of chemical theory and computation.

[59]  B. O. Koopman,et al.  Hamiltonian Systems and Transformation in Hilbert Space. , 1931, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Ioannis G. Kevrekidis,et al.  Nonlinear dimensionality reduction in molecular simulation: The diffusion map approach , 2011 .

[61]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[62]  W. L. Jorgensen,et al.  Comparison of simple potential functions for simulating liquid water , 1983 .

[63]  Deping Hu,et al.  Analysis of trajectory similarity and configuration similarity in on-the-fly surface-hopping simulation on multi-channel nonadiabatic photoisomerization dynamics. , 2018, The Journal of chemical physics.

[64]  Gerhard Stock,et al.  Dynamical coring of Markov state models. , 2019, The Journal of chemical physics.

[65]  Kresten Lindorff-Larsen,et al.  Protein folding kinetics and thermodynamics from atomistic simulation , 2012, Proceedings of the National Academy of Sciences.