A scalable algorithm to order and annotate continuous observations reveals the metastable states visited by dynamical systems

a b s t r a c t Advances in IT infrastructure have enabled the generation and storage of very large data sets describing complex systems continuously in time. These can derive from both simulations and measurements. Analysis of such data requires the availability of scalable algorithms. In this contribution, we propose a scalable algorithm that partitions instantaneous observations (snapshots) of a complex system into kinetically distinct sets (termed basins). To do so, we use a combination of ordering snapshots employing the method’s only essential parameter, i.e., a definition of pairwise distance, and annotating the resultant sequence, the so-called progress index, in different ways. Specifically, we propose a combination of cutbased and structural annotations with the former responsible for the kinetic grouping and the latter for diagnostics and interpretation. The method is applied to an illustrative test case, and the scaling of an approximate version is demonstrated to be O(N log N) with N being the number of snapshots. Two realworld data sets from river hydrology measurements and protein folding simulations are then used to highlight the utility of the method in finding basins for complex systems. Both limitations and benefits of the approach are discussed along with routes for future research.

[1]  Francesco Masulli,et al.  A survey of kernel and spectral methods for clustering , 2008, Pattern Recognit..

[2]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[3]  J. Apostolakis,et al.  Evaluation of a fast implicit solvent model for molecular dynamics simulations , 2002, Proteins.

[4]  G. Nolan,et al.  Computational solutions to large-scale data management and analysis , 2010, Nature Reviews Genetics.

[5]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[6]  Jeremy C. Smith,et al.  Hierarchical analysis of conformational dynamics in biomolecules: transition networks of metastable states. , 2007, The Journal of chemical physics.

[7]  Bo Qi,et al.  Extracting physically intuitive reaction coordinates from transition networks of a beta-sheet miniprotein. , 2010, The journal of physical chemistry. B.

[8]  Amedeo Caflisch,et al.  Free Energy Guided Sampling. , 2012, Journal of chemical theory and computation.

[9]  G. Hummer,et al.  Reaction coordinates and rates from transition paths. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Ranz,et al.  World Map of the Köppen-Geiger climate classification updated — Source link , 2006 .

[11]  W. L. Jorgensen,et al.  Development and Testing of the OPLS All-Atom Force Field on Conformational Energetics and Properties of Organic Liquids , 1996 .

[12]  S. Fortunato,et al.  Statistical physics of social dynamics , 2007, 0710.3256.

[13]  John P. Lewis,et al.  Eurographics/ Ieee-vgtc Symposium on Visualization 2009 Selecting Good Views of High-dimensional Data Using Class Consistency , 2022 .

[14]  M. Maggioni,et al.  Determination of reaction coordinates via locally scaled diffusion map. , 2011, The Journal of chemical physics.

[15]  Wilhelm Huisinga,et al.  From simulation data to conformational ensembles: Structure and dynamics‐based methods , 1999 .

[16]  Sergei V. Krivov,et al.  Is Protein Folding Sub-Diffusive? , 2010, PLoS Comput. Biol..

[17]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[18]  Anil K. Jain,et al.  Feature Selection: Evaluation, Application, and Small Sample Performance , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  K. Dill,et al.  Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. , 2007, The Journal of chemical physics.

[20]  David J Wales,et al.  Energy landscapes: some new horizons. , 2010, Current opinion in structural biology.

[21]  J. Apostolakis,et al.  Thermodynamics and Kinetics of Folding of Two Model Peptides Investigated by Molecular Dynamics Simulations , 2000 .

[22]  Jaroslav Nesetril,et al.  Otakar Boruvka on minimum spanning tree problem Translation of both the 1926 papers, comments, history , 2001, Discret. Math..

[23]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[24]  J. Peacock,et al.  Simulations of the formation, evolution and clustering of galaxies and quasars , 2005, Nature.

[25]  David Chandler,et al.  Transition path sampling: throwing ropes over rough mountain passes, in the dark. , 2002, Annual review of physical chemistry.

[26]  Wilfred F van Gunsteren,et al.  Comparing geometric and kinetic cluster algorithms for molecular simulation data. , 2010, The Journal of chemical physics.

[27]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[28]  K. Carpenter,et al.  Pesticide Occurrence and Distribution in the Lower Clackamas River Basin, Oregon, 2000-2005 , 2008 .

[29]  Jürgen Kurths,et al.  Recurrence plots for the analysis of complex systems , 2009 .

[30]  B. Nadler,et al.  Diffusion maps, spectral clustering and reaction coordinates of dynamical systems , 2005, math/0503445.

[31]  Lorenzo Moneta,et al.  ROOT - A C++ framework for petabyte data storage, statistical analysis and visualization , 2009, Comput. Phys. Commun..

[32]  Ioannis G. Kevrekidis,et al.  Equation-free: The computer-aided analysis of complex multiscale systems , 2004 .

[33]  Alessandro Laio,et al.  Which similarity measure is better for analyzing protein structures in a molecular dynamics trajectory? , 2011, Physical chemistry chemical physics : PCCP.

[34]  Christos Faloutsos,et al.  Halite: Fast and Scalable Multiresolution Local-Correlation Clustering , 2013, IEEE Transactions on Knowledge and Data Engineering.

[35]  Amedeo Caflisch,et al.  Phi-value analysis by molecular dynamics simulations of reversible folding. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Eric Vanden-Eijnden,et al.  Simplified and improved string method for computing the minimum energy paths in barrier-crossing events. , 2007, The Journal of chemical physics.

[37]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[38]  Uri Alon,et al.  Coarse-graining and self-dissimilarity of complex networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[39]  Frank O. Bryan,et al.  Impact of ocean model resolution on CCSM climate simulations , 2012, Climate Dynamics.

[40]  P. Faccioli Characterization of protein folding by dominant reaction pathways. , 2008, The journal of physical chemistry. B.

[41]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[42]  A. Caflisch,et al.  Efficient Construction of Mesostate Networks from Molecular Dynamics Trajectories. , 2012, Journal of chemical theory and computation.

[43]  Amedeo Caflisch,et al.  One-dimensional barrier-preserving free-energy projections of a beta-sheet miniprotein: new insights into the folding process. , 2008, The journal of physical chemistry. B.

[44]  David J Wales,et al.  Folding pathways and rates for the three-stranded beta-sheet peptide Beta3s using discrete path sampling. , 2008, The journal of physical chemistry. B.

[45]  Sergei V Krivov,et al.  One-dimensional free-energy profiles of complex systems: progress variables that preserve the barriers. , 2006, The journal of physical chemistry. B.

[46]  Julianne D. Halley,et al.  Classification of emergence and its relation to self-organization , 2008 .

[47]  A. Caflisch,et al.  Kinetic analysis of molecular dynamics simulations reveals changes in the denatured state and switch of folding pathways upon single‐point mutation of a β‐sheet miniprotein , 2008, Proteins.

[48]  D. Wales Discrete path sampling , 2002 .

[49]  R. Dror,et al.  How Fast-Folding Proteins Fold , 2011, Science.

[50]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[51]  M. Jarrold,et al.  Discovering free energy basins for macromolecular systems via guided multiscale simulation. , 2012, The journal of physical chemistry. B.

[52]  Kesheng Wu,et al.  Finding Tropical Cyclones on a Cloud Computing Cluster: Using Parallel Virtualization for Large-Scale Climate Simulation Analysis , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[53]  Shu Chien,et al.  Fluorescence proteins, live-cell imaging, and mechanobiology: seeing is believing. , 2008, Annual review of biomedical engineering.

[54]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[55]  Desire L. Massart,et al.  Projection methods in chemistry , 2003 .

[56]  A. Laio,et al.  Escaping free-energy minima , 2002, Proceedings of the National Academy of Sciences of the United States of America.