Amplifying state dissimilarity leads to robust and interpretable clustering of scientific data

Existing methods that aim to automatically cluster data into physically meaningful subsets typically require assumptions regarding the number, size, or shape of the coherent subgroups. We present a new method, simultaneous Coherent Structure Coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. To illustrate the versatility of the method, we apply it to frontier physics problems at vastly different temporal and spatial scales: in a theoretical model of geophysical fluid dynamics, in laboratory measurements of vortex ring formation and entrainment, and in atomistic simulation of the Protein G system. The theoretical flow involves sparse sampling of non-equilibrium dynamics, where this new technique can find and characterize the structures that govern fluid transport using two orders of magnitude less data than required by existing methods. Application of the method to empirical measurements of vortex formation leads to the discovery of a well defined region in which vortex ring entrainment occurs, with potential implications ranging from flow control to cardiovascular diagnostics. Finally, the protein folding example demonstrates a data-rich application governed by equilibrium dynamics, where the technique in this manuscript automatically discovers the hierarchy of distinct processes that govern protein folding and clusters protein configurations accordingly. We anticipate straightforward translation to many other fields where existing analysis tools, such as k-means and traditional hierarchical clustering, require ad hoc assumptions on the data structure or lack the interpretability of the present method. The method is also potentially generalizable to fields where the underlying processes are less accessible, such as genomics and neuroscience.

[1]  Joseph A. Bank,et al.  Supporting Online Material Materials and Methods Figs. S1 to S10 Table S1 References Movies S1 to S3 Atomic-level Characterization of the Structural Dynamics of Proteins , 2022 .

[2]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[3]  Michael R Allshouse,et al.  Lagrangian based methods for coherent structure detection. , 2015, Chaos.

[4]  Mohammad M. Sultan,et al.  A Minimum Variance Clustering Approach Produces Robust and Interpretable Coarse-Grained Models. , 2017, Journal of chemical theory and computation.

[5]  Bernd R. Noack,et al.  Cluster-based reduced-order modelling of a mixing layer , 2013, Journal of Fluid Mechanics.

[6]  Vijay S Pande,et al.  Improvements in Markov State Model Construction Reveal Many Non-Native Interactions in the Folding of NTL9. , 2013, Journal of chemical theory and computation.

[7]  Vicente Pérez-Muñuzuri,et al.  The impact of advective transport by the South Indian Ocean Countercurrent on the Madagascar plankton bloom , 2012 .

[8]  Vijay S Pande,et al.  Simple few-state models reveal hidden complexity in protein folding , 2012, Proceedings of the National Academy of Sciences.

[9]  John O. Dabiri,et al.  Coherent structure colouring: identification of coherent structures from sparse data using graph theory , 2016, Journal of Fluid Mechanics.

[10]  F. Noé,et al.  Kinetic distance and kinetic maps from molecular dynamics simulation. , 2015, Journal of chemical theory and computation.

[11]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[12]  Paul S. Krueger,et al.  The significance of vortex ring formation to the impulse and thrust of a starting jet , 2003 .

[13]  Vijay S Pande,et al.  Ward Clustering Improves Cross-Validated Markov State Models of Protein Folding. , 2017, Journal of chemical theory and computation.

[14]  M. Sundaralingam,et al.  Water-inserted alpha-helical segments implicate reverse turns as folding intermediates. , 1989, Science.

[15]  Javier Yáñez,et al.  Coloring fuzzy graphs , 2005 .

[16]  Mohammad Farazmand,et al.  A critical comparison of Lagrangian methods for coherent structure detection. , 2017, Chaos.

[17]  Christopher Jones,et al.  A coherent structure approach for parameter estimation in Lagrangian Data Assimilation , 2017, 1706.04834.

[18]  John O Dabiri,et al.  Identification of individual coherent sets associated with flow trajectories using coherent structure coloring. , 2017, Chaos.

[19]  Joseph Gomes,et al.  Building a More Predictive Protein Force Field: A Systematic and Reproducible Route to AMBER-FB15. , 2017, The journal of physical chemistry. B.

[20]  Kristy L. Schlueter-Kuck,et al.  Pressure evolution in the shear layer of forming vortex rings , 2016 .

[21]  P. I. Miller,et al.  Rapid Water Transport by Long‐Lasting Modon Eddy Pairs in the Southern Midlatitude Oceans , 2017 .

[22]  R. Dror,et al.  How Fast-Folding Proteins Fold , 2011, Science.

[23]  P. Deuflharda,et al.  Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains , 2000 .

[24]  Arash Kheradvar,et al.  Optimal vortex formation as an index of cardiac health. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Vijay S. Pande,et al.  Everything you wanted to know about Markov State Models but were afraid to ask. , 2010, Methods.

[26]  Francesco Carrara,et al.  Phytoplankton can actively diversify their migration strategy in response to turbulent cues , 2017, Nature.

[27]  Gary Froyland,et al.  A rough-and-ready cluster-based approach for extracting finite-time coherent sets from sparse and incomplete trajectory data. , 2015, Chaos.

[28]  P. Krueger,et al.  Measurement of ambient fluid entrainment during laminar vortex ring formation , 2008 .

[29]  V. Pande,et al.  Markov State Models: From an Art to a Science. , 2018, Journal of the American Chemical Society.