Clustering with UMAP: Why and How Connectivity Matters

Topology based dimensionality reduction methods such as t-SNE and UMAP have seen increasing success and popularity in high-dimensional data. These methods have strong mathematical foundations and are based on the intuition that the topology in low dimensions should be close to that of high dimensions. Given that the initial topological structure is a precursor to the success of the algorithm, this naturally raises the question: What makes a "good" topological structure for dimensionality reduction? %Insight into this will enable us to design better algorithms which take into account both local and global structure. In this paper which focuses on UMAP, we study the effects of node connectivity (k-Nearest Neighbors vs \textit{mutual} k-Nearest Neighbors) and relative neighborhood (Adjacent via Path Neighbors) on dimensionality reduction. We explore these concepts through extensive ablation studies on 4 standard image and text datasets; MNIST, FMNIST, 20NG, AG, reducing to 2 and 64 dimensions. Our findings indicate that a more refined notion of connectivity (\textit{mutual} k-Nearest Neighbors with minimum spanning tree) together with a flexible method of constructing the local neighborhood (Path Neighbors), can achieve a much better representation than default UMAP, as measured by downstream clustering performance.

[1]  Shichao Zhang,et al.  Noisy data elimination using mutual k-nearest neighbor for classification mining , 2012, J. Syst. Softw..

[2]  Arthur Flexer,et al.  The unbalancing effect of hubs on K-medoids clustering in high-dimensional spaces , 2015, 2015 International Joint Conference on Neural Networks (IJCNN).

[3]  Andrew J. Hill,et al.  The single cell transcriptional landscape of mammalian organogenesis , 2019, Nature.

[4]  Dunja Mladenic,et al.  The Role of Hubness in Clustering High-Dimensional Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[5]  Mukund Balasubramanian,et al.  The Isomap Algorithm and Topological Stability , 2002, Science.

[6]  Amin A. Shoukry,et al.  CMUNE: A clustering using mutual nearest neighbors algorithm , 2012, 2012 11th International Conference on Information Science, Signal Processing and their Applications (ISSPA).

[7]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[8]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[9]  Krisztian Buza,et al.  Hubness-aware kNN classification of high-dimensional data in presence of label noise , 2015, Neurocomputing.

[10]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[11]  D. Kobak,et al.  Initialization is critical for preserving global data structure in both t-SNE and UMAP , 2021, Nature Biotechnology.

[12]  Alex Diaz-Papkovich,et al.  A review of UMAP in population genetics , 2020, Journal of human genetics.

[13]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[14]  Mohammed Lamine Kherfi,et al.  Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study , 2020, ICISP.

[15]  Raj Bhatnagar,et al.  Graph Clustering Using Mutual K-Nearest Neighbors , 2014, AMT.

[16]  Cordelia Schmid,et al.  Accurate Image Search Using the Contextual Dissimilarity Measure , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Tolga Bolukbasi,et al.  The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP Models , 2020, EMNLP.

[18]  Ulrike von Luxburg,et al.  Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters , 2009, Theoretical Computer Science.

[19]  Georgiana Dinu,et al.  Improving zero-shot learning by mitigating the hubness problem , 2014, ICLR.

[20]  Subhradeep Kayal,et al.  Unsupervised Sentence-embeddings by Manifold Approximation and Projection , 2021, EACL.

[21]  Yuji Matsumoto,et al.  Using the Mutual k-Nearest Neighbor Graphs for Semi-supervised Classification on Natural Language Data , 2011, CoNLL.

[22]  Thomas Haider,et al.  CMCE at SemEval-2020 Task 1: Clustering on Manifolds of Contextualized Embeddings to Detect Historical Meaning Shifts , 2020, SemEval@COLING.

[23]  M. Walton,et al.  Application of Uniform Manifold Approximation and Projection (UMAP) in spectral imaging of artworks. , 2021, Spectrochimica acta. Part A, Molecular and biomolecular spectroscopy.

[24]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[25]  Celso André R. de Sousa,et al.  Influence of Graph Construction on Semi-supervised Learning , 2013, ECML/PKDD.

[26]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[27]  Kai Li,et al.  Efficient k-nearest neighbor graph construction for generic similarity measures , 2011, WWW.

[28]  Arthur Flexer,et al.  A comprehensive empirical comparison of hubness reduction in high-dimensional spaces , 2018, Knowledge and Information Systems.