Fast Similarity Computation for t-SNE

Data visualization has become a fundamental process of data engineering. t-SNE is one of the most popular data visualization approaches. However, its computation cost is quadratic to the number of data points because it needs to compute similarities for all pairs of data points. One practical way of using t-SNE is random walk-based t-SNE. This approach visualizes user-specified landmark points from the similarities between them based on random walks in a neighborhood graph of data points. It offers two approaches to computing similarities: the direct and analytical approaches. The direct approach approximately computes similarities by explicitly computing random walks in the graph. Unfortunately, it needs to perform numerous random walks for adequate computation accuracy. The analytical approach performs Cholesky factorization on the graph Laplacian and computes exact similarities using the decomposed graph Laplacian. This, however, incurs high computation cost in performing Cholesky factorization. Our proposal, F-tSNE, reduces the computation cost of random walk-based t-SNE by computing the LDL decomposition for the graph Laplacian based on two ideas: (1) reducing non-zero elements in the LDL decomposition by using a reordering matrix and (2) exploiting the sparse structure of the graph when computing the similarities. Theoretically, our approach is guaranteed to yield exact similarities. Experiments show that it is up to 88.4 times faster than the existing alternatives.

[1]  Dennis Shasha,et al.  High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science) , 2004 .

[2]  Shih-Fu Chang,et al.  Graph construction and b-matching for semi-supervised learning , 2009, ICML '09.

[3]  U. Munari,et al.  The Galah Survey: Classification and Diagnostics with t-SNE Reduction of Spectral Information , 2016, 1612.02242.

[4]  Naonori Ueda,et al.  Efficient Algorithm for the b-Matching Graph , 2020, KDD.

[5]  Marc Snir,et al.  Optimizing the Barnes-Hut algorithm in UPC , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  Gary L. Miller,et al.  A Nearly-m log n Time Solver for SDD Linear Systems , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[7]  Yasuhiro Fujiwara,et al.  Fast Algorithm for Modularity-Based Graph Clustering , 2013, AAAI.

[8]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[9]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[10]  Shang-Hua Teng,et al.  Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems , 2003, STOC '04.

[11]  Yasuhiro Fujiwara,et al.  Scaling Manifold Ranking Based Image Retrieval , 2014, Proc. VLDB Endow..

[12]  Miguel Á. Carreira-Perpiñán,et al.  The Variational Nystrom method for large-scale spectral problems , 2016, ICML.

[13]  Jingzhou Liu,et al.  Visualizing Large-scale and High-dimensional Data , 2016, WWW.

[14]  Dandan Lin,et al.  First Index-Free Manifold Ranking-Based Image Retrieval with Output Bound , 2019, 2019 IEEE International Conference on Data Mining (ICDM).

[15]  Hisashi Kashima,et al.  Fast Sparse Group Lasso , 2019, NeurIPS.

[16]  Jakub W. Pachocki,et al.  Solving SDD linear systems in nearly mlog1/2n time , 2014, STOC.

[17]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[18]  Koh Takeuchi,et al.  Fast Deterministic CUR Matrix Decomposition with Accuracy Assurance , 2020, ICML.

[19]  Boudewijn P F Lelieveldt,et al.  Data-driven identification of prognostic tumor subpopulations using spatially mapped t-SNE of mass spectrometry imaging data , 2016, Proceedings of the National Academy of Sciences.

[20]  Abraham Yosipof,et al.  Data Mining and Machine Learning Models for Predicting Drug Likeness and Their Disease or Organ Category , 2018, Front. Chem..

[21]  Elmar Eisemann,et al.  Approximated and User Steerable tSNE for Progressive Visual Analytics , 2015, IEEE Transactions on Visualization and Computer Graphics.

[22]  Tania Akter,et al.  Analyzing Cervical Cancer by using an Ensemble Learning Approach based on Meta Classifier , 2019, International Journal of Computer Applications.

[23]  Chun Chen,et al.  EMR: A Scalable Graph-Based Ranking Model for Content-Based Image Retrieval , 2015, IEEE Transactions on Knowledge and Data Engineering.

[24]  Machiko Toyoda,et al.  Adaptive Message Update for Fast Affinity Propagation , 2015, KDD.

[25]  G. O. Chagas,et al.  An evaluation of reordering algorithms to reduce the computational cost of the incomplete Cholesky-conjugate gradient method , 2018 .

[26]  Miguel Á. Carreira-Perpiñán,et al.  Locally Linear Landmarks for Large-Scale Manifold Learning , 2013, ECML/PKDD.

[27]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[28]  Achi Brandt,et al.  Lean Algebraic Multigrid (LAMG): Fast Graph Laplacian Linear Solver , 2011, SIAM J. Sci. Comput..

[29]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[30]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..