Self-Organizing Nebulous Growths for Robust and Incremental Data Visualization

Nonparametric dimensionality reduction techniques, such as t-distributed Stochastic Neighbor Embedding (t-SNE) and uniform manifold approximation and projection (UMAP), are proficient in providing visualizations for data sets of fixed sizes. However, they cannot incrementally map and insert new data points into an already provided data visualization. We present self-organizing nebulous growths (SONG), a parametric nonlinear dimensionality reduction technique that supports incremental data visualization, i.e., incremental addition of new data while preserving the structure of the existing visualization. In addition, SONG is capable of handling new data increments, no matter whether they are similar or heterogeneous to the already observed data distribution. We test SONG on a variety of real and simulated data sets. The results show that SONG is superior to Parametric t-SNE, t-SNE, and UMAP in incremental data visualization. Especially, for heterogeneous increments, SONG improves over Parametric t-SNE by 14.98% on the Fashion MNIST data set and 49.73% on the MNIST data set regarding the cluster quality measured by the adjusted mutual information scores. On similar or homogeneous increments, the improvements are 8.36% and 42.26%, respectively. Furthermore, even when the abovementioned data sets are presented all at once, SONG performs better or comparable to UMAP and superior to t-SNE. We also demonstrate that the algorithmic foundations of SONG render it more tolerant to noise compared with UMAP and t-SNE, thus providing greater utility for data with high variance, high mixing of clusters, or noise.

[1]  Bernd Fritzke,et al.  Growing cell structures--A self-organizing network for unsupervised and supervised learning , 1994, Neural Networks.

[2]  Klaus-Robert Müller,et al.  Interpretable deep neural networks for single-trial EEG classification , 2016, Journal of Neuroscience Methods.

[3]  Anne Condon,et al.  Interpretable dimensionality reduction of single cell transcriptome data with deep generative models , 2017, Nature Communications.

[4]  Stefan Steinerberger,et al.  Fast Interpolation-based t-SNE for Improved Visualization of Single-Cell RNA-Seq Data , 2017, Nature Methods.

[5]  Caleb Weinreb,et al.  SPRING: a kinetic interface for visualizing high dimensional single-cell expression data , 2017, bioRxiv.

[6]  Bala Srinivasan,et al.  Dynamic self-organizing maps with controlled growth for knowledge discovery , 2000, IEEE Trans. Neural Networks Learn. Syst..

[7]  Lorena Montoya,et al.  Geo-data acquisition through mobile GIS and digital video: an urban disaster management perspective , 2003, Environ. Model. Softw..

[8]  N. McGovern,et al.  A High-Dimensional Atlas of Human T Cell Diversity Reveals Tissue-Specific Trafficking and Cytokine Signatures. , 2016, Immunity.

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Barbara Hammer,et al.  Parametric nonlinear dimensionality reduction using kernel t-SNE , 2015, Neurocomputing.

[11]  Laurens van der Maaten,et al.  Learning a Parametric Embedding by Preserving Local Structure , 2009, AISTATS.

[12]  Marie Cottrell,et al.  Advantages and drawbacks of the Batch Kohonen algorithm , 2002, ESANN.

[13]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[14]  Saman K. Halgamuge,et al.  Investigation of Average Mutual Information for Species Separation Using GSOM , 2009, FGIT.

[15]  Monica M. C. Schraefel,et al.  Trust me, i'm partially right: incremental visualization lets analysts explore large datasets faster , 2012, CHI.

[16]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[17]  Mario Roederer,et al.  A new “Logicle” display method avoids deceptive effects of logarithmic scaling for low signals and compensated data , 2006, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[18]  Lijuan Cao,et al.  A comparison of PCA, KPCA and ICA for dimensionality reduction in support vector machine , 2003, Neurocomputing.

[19]  Jingzhou Liu,et al.  Visualizing Large-scale and High-dimensional Data , 2016, WWW.

[20]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[21]  Sameer A. Nene,et al.  Columbia Object Image Library (COIL100) , 1996 .

[22]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[23]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[24]  P.A. Estevez,et al.  Cross-entropy approach to data visualization based on the neural gas network , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[25]  Kim-Kwang Raymond Choo,et al.  Improved t-SNE based manifold dimensional reduction for remote sensing data processing , 2018, Multimedia Tools and Applications.

[26]  Terence P Speed,et al.  RLE plots: Visualizing unwanted variation in high dimensional data , 2017, PloS one.

[27]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[28]  Manfred K. Warmuth,et al.  TriMap: Large-scale Dimensionality Reduction Using Triplets , 2019, ArXiv.

[29]  Satoru Kawai,et al.  An Algorithm for Drawing General Undirected Graphs , 1989, Inf. Process. Lett..

[30]  Peter Eades,et al.  An Algorithm for Drawing General Undirected Graphs , 1984 .

[31]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[32]  Bernd Fritzke,et al.  A Growing Neural Gas Network Learns Topologies , 1994, NIPS.

[33]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[34]  E. Pierson,et al.  ZIFA: Dimensionality reduction for zero-inflated single-cell gene expression analysis , 2015, Genome Biology.

[35]  Sean C. Bendall,et al.  viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia , 2013, Nature Biotechnology.

[36]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[37]  R. Guigó,et al.  Transcriptome genetics using second generation sequencing in a Caucasian population , 2010, Nature.

[38]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[39]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[40]  FritzkeBernd Growing cell structuresa self-organizing network for unsupervised and supervised learning , 1994 .

[41]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[42]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..