UMAP does not reproduce high-dimensional similarities due to negative sampling

UMAP has supplanted t-SNE as state-of-the-art for visualizing high-dimensional datasets in many disciplines, while the reason for its success is not well understood. In this work, we investigate UMAP's sampling based optimization scheme in detail. We derive UMAP's effective loss function in closed form and find that it differs from the published one. As a consequence, we show that UMAP does not aim to reproduce its theoretically motivated high-dimensional UMAP similarities. Instead, it tries to reproduce similarities that only encode the shared k nearest neighbor graph, thereby challenging the previous understanding of UMAP's effectiveness. Instead, we claim that the key to UMAP's success is its implicit balancing of attraction and repulsion resulting from negative sampling. This balancing in turn facilitates optimization via gradient descent. We corroborate our theoretical findings on toy and single cell RNA sequencing data.

[1]  Jan Niklas Böhm,et al.  A Unifying Perspective on Neighbor Embeddings along the Attraction-Repulsion Spectrum , 2020, ArXiv.

[2]  Jonathan S. Packer,et al.  A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution , 2019, Science.

[3]  D. Kobak,et al.  Initialization is critical for preserving global data structure in both t-SNE and UMAP , 2021, Nature Biotechnology.

[4]  Stefan Steinerberger,et al.  Clustering with t-SNE, provably , 2017, SIAM J. Math. Data Sci..

[5]  Leland McInnes,et al.  Parametric UMAP: learning embeddings with deep neural networks for representation and semi-supervised learning , 2020, ArXiv.

[6]  Jean Feydy,et al.  Kernel Operations on the GPU, with Autodiff, without Memory Overflows , 2020, J. Mach. Learn. Res..

[7]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[8]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[9]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[10]  Hyunghoon Cho,et al.  Density-Preserving Data Visualization Unveils Dynamic Patterns of Single-Cell Transcriptomic Variability , 2020, bioRxiv.

[11]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[12]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.