Landmark diffusion maps (L-dMaps): Accelerated manifold learning out-of-sample extension

Diffusion maps are a nonlinear manifold learning technique based on harmonic analysis of a diffusion process over the data. Out-of-sample extensions with computational complexity $\mathcal{O}(N)$, where $N$ is the number of points comprising the manifold, frustrate applications to online learning applications requiring rapid embedding of high-dimensional data streams. We propose landmark diffusion maps (L-dMaps) to reduce the complexity to $\mathcal{O}(M)$, where $M \ll N$ is the number of landmark points selected using pruned spanning trees or k-medoids. Offering $(N/M)$ speedups in out-of-sample extension, L-dMaps enables the application of diffusion maps to high-volume and/or high-velocity streaming data. We illustrate our approach on three datasets: the Swiss roll, molecular simulations of a C$_{24}$H$_{50}$ polymer chain, and biomolecular simulations of alanine dipeptide. We demonstrate up to 50-fold speedups in out-of-sample extension for the molecular systems with less than 4% errors in manifold reconstruction fidelity relative to calculations over the full dataset.

[1]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[2]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[4]  K. Dill,et al.  Automatic discovery of metastable states for the construction of Markov models of macromolecular conformational dynamics. , 2007, The Journal of chemical physics.

[5]  R. Prim Shortest connection networks and some generalizations , 1957 .

[6]  Daniel Peña,et al.  Dimension Reduction in Multivariate Time Series , 2006 .

[7]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[8]  Minsu Cho,et al.  Reweighted Random Walks for Graph Matching , 2010, ECCV.

[9]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[10]  A. S. Deif,et al.  Rigorous perturbation bounds for eigenvalues and eigenvectors of a matrix , 1995 .

[11]  Lydia E Kavraki,et al.  Low-dimensional, free-energy landscapes of protein-folding reactions by nonlinear dimensionality reduction , 2006, Proc. Natl. Acad. Sci. USA.

[12]  Jung-Min Park,et al.  An overview of anomaly detection techniques: Existing solutions and latest technological trends , 2007, Comput. Networks.

[13]  Andrew L. Ferguson,et al.  Nonlinear reconstruction of single-molecule free-energy surfaces from univariate time series. , 2016, Physical review. E.

[14]  Andrew L. Ferguson,et al.  Global graph matching using diffusion maps , 2016, Intell. Data Anal..

[15]  Aaron R Dinner,et al.  Automatic method for identifying reaction coordinates in complex systems. , 2005, The journal of physical chemistry. B.

[16]  Ronald R. Coifman,et al.  Diffusion Maps, Reduction Coordinates, and Low Dimensional Representation of Stochastic Systems , 2008, Multiscale Model. Simul..

[17]  Victor Y. Pan,et al.  The complexity of the matrix eigenproblem , 1999, STOC '99.

[18]  R. Friesner,et al.  Evaluation and Reparametrization of the OPLS-AA Force Field for Proteins via Comparison with Accurate Quantum Chemical Calculations on Peptides† , 2001 .

[19]  Baoqun Yin,et al.  A novel landmark point selection method for L-ISOMAP , 2016, 2016 12th IEEE International Conference on Control and Automation (ICCA).

[20]  W. L. Jorgensen,et al.  Comparison of simple potential functions for simulating liquid water , 1983 .

[21]  I. Kevrekidis,et al.  Coarse-graining the dynamics of a driven interface in the presence of mobile impurities: effective description via diffusion maps. , 2009, Physical review. E, Statistical, nonlinear, and soft matter physics.

[22]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[23]  Shu-Lin Wang,et al.  Fast ISOMAP Based on Minimum Set Coverage , 2010, ICIC.

[24]  Ronald R. Coifman,et al.  Data Fusion and Multicue Data Matching by Diffusion Maps , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Hernan F. Stamati,et al.  Application of nonlinear dimensionality reduction to characterize the conformational landscape of small peptides , 2010, Proteins.

[26]  Rachael A Mansbach,et al.  Machine learning of single molecule free energy surfaces and the impact of chemistry and environment upon structure and dynamics. , 2015, The Journal of chemical physics.

[27]  B. Nadler,et al.  Diffusion maps, spectral clustering and reaction coordinates of dynamical systems , 2005, math/0503445.

[28]  Ioannis G Kevrekidis,et al.  Intrinsic map dynamics exploration for uncharted effective free-energy landscapes , 2016, Proceedings of the National Academy of Sciences.

[29]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[30]  I. Kevrekidis,et al.  Coarse molecular dynamics of a peptide fragment: Free energy, kinetics, and long-time dynamics computations , 2002, physics/0212108.

[31]  B. L. de Groot,et al.  Molecular dynamics in principal component space. , 2012, The journal of physical chemistry. B.

[32]  Jonathan P. How,et al.  Motion planning with diffusion maps , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[33]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[34]  Andrew W. Long,et al.  Machine learning assembly landscapes from particle tracking data. , 2015, Soft matter.

[35]  Ronald R. Coifman,et al.  Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators , 2005, NIPS.

[36]  Amir Averbuch,et al.  PCA-Based Out-of-Sample Extension for Dimensionality Reduction , 2015, 1511.00831.

[37]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[38]  Andrew L. Ferguson,et al.  Systematic determination of order parameters for chain dynamics using diffusion maps , 2010, Proceedings of the National Academy of Sciences.

[39]  Ann B. Lee,et al.  Geometric diffusions as a tool for harmonic analysis and structure definition of data: diffusion maps. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[41]  G. Micula,et al.  Numerical Treatment of the Integral Equations , 1999 .

[42]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[43]  Ioannis G Kevrekidis,et al.  Integrating diffusion maps with umbrella sampling: application to alanine dipeptide. , 2011, The Journal of chemical physics.

[44]  Jorge S. Marques,et al.  Selecting Landmark Points for Sparse Manifold Learning , 2005, NIPS.

[45]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[46]  Cecilia Clementi,et al.  Rapid exploration of configuration space with diffusion-map-directed molecular dynamics. , 2013, The journal of physical chemistry. B.

[47]  Nicolas Le Roux,et al.  Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering , 2003, NIPS.

[48]  Salvatore J. Stolfo,et al.  A Geometric Framework for Unsupervised Anomaly Detection , 2002, Applications of Data Mining in Computer Security.

[49]  Joshua B. Tenenbaum,et al.  Global Versus Local Methods in Nonlinear Dimensionality Reduction , 2002, NIPS.

[50]  John Riedl,et al.  Application of Dimensionality Reduction in Recommender System - A Case Study , 2000 .

[51]  R. Zwanzig Nonequilibrium statistical mechanics , 2001, Physics Subject Headings (PhySH).

[52]  Ronald R. Coifman,et al.  Heterogeneous Datasets Representation and Learning using Diffusion Maps and Laplacian Pyramids , 2012, SDM.

[53]  W. L. Jorgensen,et al.  The OPLS [optimized potentials for liquid simulations] potential functions for proteins, energy minimizations for crystals of cyclic peptides and crambin. , 1988, Journal of the American Chemical Society.

[54]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[55]  Ming-Yang Kao,et al.  Encyclopedia of Algorithms , 2016, Springer New York.

[56]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[57]  J. Preto,et al.  Fast recovery of free energy landscapes via diffusion-map-directed molecular dynamics. , 2014, Physical chemistry chemical physics : PCCP.

[58]  Ronald R. Coifman,et al.  Graph Laplacian Tomography From Unknown Random Projections , 2008, IEEE Transactions on Image Processing.

[59]  Amit Singer,et al.  A remark on global positioning from local distances , 2008, Proceedings of the National Academy of Sciences.

[60]  A. Voter Parallel replica method for dynamics of infrequent events , 1998 .

[61]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[62]  Andrew L. Ferguson,et al.  An experimental and computational investigation of spontaneous lasso formation in microcin J25. , 2010, Biophysical journal.

[63]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[64]  J. Ilja Siepmann,et al.  Transferable Potentials for Phase Equilibria. 1. United-Atom Description of n-Alkanes , 1998 .

[65]  J. Araque,et al.  Transition path sampling and forward flux sampling. Applications to biological systems , 2009, Journal of physics. Condensed matter : an Institute of Physics journal.

[66]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[67]  Bryan C. Daniels,et al.  Perspective: Sloppiness and emergent theories in physics, biology, and beyond. , 2015, The Journal of chemical physics.

[68]  Marina Meila,et al.  Megaman: Scalable Manifold Learning in Python , 2016, J. Mach. Learn. Res..

[69]  Narayanaswamy Balakrishnan,et al.  Advances in Distribution Theory, Order Statistics, and Inference , 2007 .

[70]  Gerrit Groenhof,et al.  GROMACS: Fast, flexible, and free , 2005, J. Comput. Chem..

[71]  David Johnson,et al.  Deformable robot motion planning in a reduced-dimension configuration space , 2010, 2010 IEEE International Conference on Robotics and Automation.

[72]  J. Sethna,et al.  Parameter Space Compression Underlies Emergent Theories and Predictive Models , 2013, Science.

[73]  Jan G. Korvink,et al.  Fast Simulation of Electro-Thermal MEMS: Efficient Dynamic Compact Models , 2006 .

[74]  Ioannis G. Kevrekidis,et al.  Nonlinear dimensionality reduction in molecular simulation: The diffusion map approach , 2011 .

[75]  Rosalind J Allen,et al.  Forward flux sampling for rare event simulations , 2009, Journal of physics. Condensed matter : an Institute of Physics journal.

[76]  John D. Chodera,et al.  Long-Time Protein Folding Dynamics from Short-Time Molecular Dynamics Simulations , 2006, Multiscale Model. Simul..

[77]  H. Berendsen,et al.  Interaction Models for Water in Relation to Protein Hydration , 1981 .

[78]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[79]  R. Larsen Lanczos Bidiagonalization With Partial Reorthogonalization , 1998 .

[80]  Yosi Keller,et al.  Image Completion by Diffusion Maps and Spectral Relaxation , 2013, IEEE Transactions on Image Processing.

[81]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.