Simultaneous coherent structure coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity

The clustering of data into physically meaningful subsets often requires assumptions regarding the number, size, or shape of the subgroups. Here, we present a new method, simultaneous coherent structure coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. sCSC performs a sequence of binary splittings on the dataset such that the most dissimilar data points are required to be in separate clusters. To achieve this, we obtain a set of orthogonal coordinates along which dissimilarity in the dataset is maximized from a generalized eigenvalue problem based on the pairwise dissimilarity between the data points to be clustered. This sequence of bifurcations produces a binary tree representation of the system, from which the number of clusters in the data and their interrelationships naturally emerge. To illustrate the effectiveness of the method in the absence of a priori assumptions, we apply it to three exemplary problems in fluid dynamics. Then, we illustrate its capacity for interpretability using a high-dimensional protein folding simulation dataset. While we restrict our examples to dynamical physical systems in this work, we anticipate straightforward translation to other fields where existing analysis tools require ad hoc assumptions on the data structure, lack the interpretability of the present method, or in which the underlying processes are less accessible, such as genomics and neuroscience.

[1]  Paul S. Krueger,et al.  The significance of vortex ring formation to the impulse and thrust of a starting jet , 2003 .

[2]  Frank Noé,et al.  A Variational Approach to Modeling Slow Processes in Stochastic Dynamical Systems , 2012, Multiscale Model. Simul..

[3]  Christopher Jones,et al.  A coherent structure approach for parameter estimation in Lagrangian Data Assimilation , 2017, 1706.04834.

[4]  P. Krueger,et al.  Measurement of ambient fluid entrainment during laminar vortex ring formation , 2008 .

[5]  Arash Kheradvar,et al.  Optimal vortex formation as an index of cardiac health. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Katharina Wagner,et al.  On Numerical Approximation , 2016 .

[7]  Javier Yáñez,et al.  Coloring fuzzy graphs , 2005 .

[8]  John Morrissey,et al.  Data driven. , 2019, Hospitals & health networks.

[9]  V. Pande,et al.  Markov State Models: From an Art to a Science. , 2018, Journal of the American Chemical Society.

[10]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[11]  Joseph Gomes,et al.  Building a More Predictive Protein Force Field: A Systematic and Reproducible Route to AMBER-FB15. , 2017, The journal of physical chemistry. B.

[12]  P. Holmes,et al.  Nonlinear Oscillations, Dynamical Systems, and Bifurcations of Vector Fields , 1983, Applied Mathematical Sciences.

[13]  Frank Noé,et al.  Variational Koopman models: Slow collective variables and molecular kinetics from short off-equilibrium simulations. , 2016, The Journal of chemical physics.

[14]  Gary Froyland,et al.  A rough-and-ready cluster-based approach for extracting finite-time coherent sets from sparse and incomplete trajectory data. , 2015, Chaos.

[15]  John O. Dabiri,et al.  Fluid entrainment by isolated vortex rings , 2004, Journal of Fluid Mechanics.

[16]  Vijay S Pande,et al.  Ward Clustering Improves Cross-Validated Markov State Models of Protein Folding. , 2017, Journal of chemical theory and computation.

[17]  M. Sundaralingam,et al.  Water-inserted alpha-helical segments implicate reverse turns as folding intermediates. , 1989, Science.

[18]  Joseph Gomes,et al.  MoleculeNet: a benchmark for molecular machine learning† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc02664a , 2017, Chemical science.

[19]  Vijay S. Pande,et al.  Everything you wanted to know about Markov State Models but were afraid to ask. , 2010, Methods.

[20]  O. Junge,et al.  On the Approximation of Complicated Dynamical Behavior , 1999 .

[21]  F. J. Beron-Vera,et al.  On the Lagrangian Dynamics of Atmospheric Zonal Jets and the Permeability of the Stratospheric Polar Vortex , 2006 .

[22]  Michael Dellnitz,et al.  Detecting and Locating Near-Optimal Almost-Invariant Sets and Cycles , 2002, SIAM J. Sci. Comput..

[23]  Kristy L. Schlueter-Kuck,et al.  Pressure evolution in the shear layer of forming vortex rings , 2016 .

[24]  John O Dabiri,et al.  Identification of individual coherent sets associated with flow trajectories using coherent structure coloring. , 2017, Chaos.

[25]  Kenneth M. Hall An r-Dimensional Quadratic Placement Algorithm , 1970 .

[26]  Bernd R. Noack,et al.  Cluster-based reduced-order modelling of a mixing layer , 2013, Journal of Fluid Mechanics.

[27]  Thomas J Lane,et al.  MDTraj: a modern, open library for the analysis of molecular dynamics trajectories , 2014, bioRxiv.

[28]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[29]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[30]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[31]  P. I. Miller,et al.  Rapid Water Transport by Long‐Lasting Modon Eddy Pairs in the Southern Midlatitude Oceans , 2017 .

[32]  R. Dror,et al.  How Fast-Folding Proteins Fold , 2011, Science.

[33]  Vijay S Pande,et al.  Simple few-state models reveal hidden complexity in protein folding , 2012, Proceedings of the National Academy of Sciences.

[34]  John O. Dabiri,et al.  Coherent structure colouring: identification of coherent structures from sparse data using graph theory , 2016, Journal of Fluid Mechanics.

[35]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[36]  Sohail Asghar,et al.  Critical analysis of DBSCAN variations , 2010, 2010 International Conference on Information and Emerging Technologies.

[37]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[38]  Francesco Carrara,et al.  Phytoplankton can actively diversify their migration strategy in response to turbulent cues , 2017, Nature.

[39]  B. Jaumard,et al.  Cluster Analysis and Mathematical Programming , 2003 .

[40]  Stefan Klus,et al.  On the numerical approximation of the Perron-Frobenius and Koopman operator , 2015, 1512.05997.

[41]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[42]  G. Haller,et al.  Lagrangian coherent structures and mixing in two-dimensional turbulence , 2000 .

[43]  F. Noé,et al.  Kinetic distance and kinetic maps from molecular dynamics simulation. , 2015, Journal of chemical theory and computation.

[44]  P. Deuflhard,et al.  A Direct Approach to Conformational Dynamics Based on Hybrid Monte Carlo , 1999 .

[45]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[46]  Joseph A. Bank,et al.  Supporting Online Material Materials and Methods Figs. S1 to S10 Table S1 References Movies S1 to S3 Atomic-level Characterization of the Structural Dynamics of Proteins , 2022 .

[47]  Michael R Allshouse,et al.  Lagrangian based methods for coherent structure detection. , 2015, Chaos.

[48]  Vijay S. Pande,et al.  Screen Savers of the World Unite! , 2000, Science.

[49]  Mohammad M. Sultan,et al.  A Minimum Variance Clustering Approach Produces Robust and Interpretable Coarse-Grained Models. , 2017, Journal of chemical theory and computation.

[50]  William H. Press,et al.  Numerical recipes , 1990 .

[51]  George Haller,et al.  Spectral-clustering approach to Lagrangian vortex detection. , 2015, Physical review. E.

[52]  J. Dabiri Optimal Vortex Formation as a Unifying Principle in Biological Propulsion , 2009 .

[53]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[54]  P. Deuflharda,et al.  Identification of almost invariant aggregates in reversible nearly uncoupled Markov chains , 2000 .

[55]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[56]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[57]  Mohammad Farazmand,et al.  A critical comparison of Lagrangian methods for coherent structure detection. , 2017, Chaos.

[58]  Dominik Endres,et al.  A new metric for probability distributions , 2003, IEEE Transactions on Information Theory.

[59]  Vicente Pérez-Muñuzuri,et al.  The impact of advective transport by the South Indian Ocean Countercurrent on the Madagascar plankton bloom , 2012 .

[60]  Clarence W. Rowley,et al.  A Data–Driven Approximation of the Koopman Operator: Extending Dynamic Mode Decomposition , 2014, Journal of Nonlinear Science.

[61]  Vijay S Pande,et al.  Improvements in Markov State Model Construction Reveal Many Non-Native Interactions in the Folding of NTL9. , 2013, Journal of chemical theory and computation.

[62]  I. Mezić Spectral Properties of Dynamical Systems, Model Reduction and Decompositions , 2005 .

[63]  Vijay S. Pande,et al.  MoleculeNet: a benchmark for molecular machine learning , 2017, Chemical science.

[64]  Stefan Van Aelst,et al.  Fast and robust bootstrap for multivariate inference: The R package FRB , 2013 .

[65]  Bernhard O Palsson,et al.  Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance , 2018, Nature Communications.