LDSScanner: Exploratory Analysis of Low-Dimensional Structures in High-Dimensional Datasets

Many approaches for analyzing a high-dimensional dataset assume that the dataset contains specific structures, e.g., clusters in linear subspaces or non-linear manifolds. This yields a trial-and-error process to verify the appropriate model and parameters. This paper contributes an exploratory interface that supports visual identification of low-dimensional structures in a high-dimensional dataset, and facilitates the optimized selection of data models and configurations. Our key idea is to abstract a set of global and local feature descriptors from the neighborhood graph-based representation of the latent low-dimensional structure, such as pairwise geodesic distance (GD) among points and pairwise local tangent space divergence (LTSD) among pointwise local tangent spaces (LTS). We propose a new LTSD-GD view, which is constructed by mapping LTSD and GD to the <inline-formula><tex-math notation="LaTeX">$x$</tex-math><alternatives><inline-graphic xlink:href="24tvcg01-xia-2744098-ieq-1-source.gif"/></alternatives></inline-formula> axis and <inline-formula><tex-math notation="LaTeX">$y$</tex-math><alternatives><inline-graphic xlink:href="24tvcg01-xia-2744098-ieq-2-source.gif"/></alternatives></inline-formula> axis using 1D multidimensional scaling, respectively. Unlike traditional dimensionality reduction methods that preserve various kinds of distances among points, the LTSD-GD view presents the distribution of pointwise LTS (<inline-formula><tex-math notation="LaTeX">$x$</tex-math><alternatives><inline-graphic xlink:href="24tvcg01-xia-2744098-ieq-3-source.gif"/></alternatives></inline-formula> axis) and the variation of LTS in structures (the combination of <inline-formula><tex-math notation="LaTeX">$x$</tex-math><alternatives><inline-graphic xlink:href="24tvcg01-xia-2744098-ieq-4-source.gif"/></alternatives></inline-formula> axis and <inline-formula><tex-math notation="LaTeX">$y$</tex-math><alternatives><inline-graphic xlink:href="24tvcg01-xia-2744098-ieq-5-source.gif"/></alternatives></inline-formula> axis). We design and implement a suite of visual tools for navigating and reasoning about intrinsic structures of a high-dimensional dataset. Three case studies verify the effectiveness of our approach.

[1]  Hans-Peter Kriegel,et al.  Subspace selection for clustering high-dimensional data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[2]  Alberto Sánchez,et al.  A comparative study between RadViz and Star Coordinates , 2016, IEEE Transactions on Visualization and Computer Graphics.

[3]  P. Tseng Nearest q-Flat to m Points , 2000 .

[4]  Michael H. F. Wilkinson,et al.  Finding and visualizing relevant subspaces for clustering high-dimensional astronomical data using connected morphological operators , 2010, 2010 IEEE Symposium on Visual Analytics Science and Technology.

[5]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[6]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[7]  René Vidal,et al.  Sparse subspace clustering , 2009, CVPR.

[8]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[9]  Matthew Chalmers,et al.  A virtual workspace for hybrid multidimensional scaling algorithms , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[10]  Alfred Inselberg,et al.  Parallel coordinates: a tool for visualizing multi-dimensional geometry , 1990, Proceedings of the First IEEE Conference on Visualization: Visualization `90.

[11]  Michael J. McGuffin,et al.  GPLOM: The Generalized Plot Matrix for Visualizing Multidimensional Multivariate Data , 2013, IEEE Transactions on Visualization and Computer Graphics.

[12]  Ben Shneiderman,et al.  A Rank-by-Feature Framework for Unsupervised Multidimensional Data Exploration Using Low Dimensional Projections , 2004, IEEE Symposium on Information Visualization.

[13]  David S. Ebert,et al.  DimScanner: A relation-based visual exploration approach towards data dimension inspection , 2016, 2016 IEEE Conference on Visual Analytics Science and Technology (VAST).

[14]  张振跃,et al.  Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment , 2004 .

[15]  Hans-Peter Kriegel,et al.  Subspace clustering , 2012, WIREs Data Mining Knowl. Discov..

[16]  Elmar Eisemann,et al.  Approximated and User Steerable tSNE for Progressive Visual Analytics , 2015, IEEE Transactions on Visualization and Computer Graphics.

[17]  Kanit Wongsuphasawat,et al.  Voyager: Exploratory Analysis via Faceted Browsing of Visualization Recommendations , 2016, IEEE Transactions on Visualization and Computer Graphics.

[18]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[19]  Georges G. Grinstein,et al.  DNA visual and analytic data mining , 1997 .

[20]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[21]  Daniel A. Keim,et al.  Visual Interaction with Dimensionality Reduction: A Structured Literature Analysis , 2017, IEEE Transactions on Visualization and Computer Graphics.

[22]  Ira Assent,et al.  VISA: visual subspace clustering analysis , 2007, SKDD.

[23]  René Vidal,et al.  Sparse Manifold Clustering and Embedding , 2011, NIPS.

[24]  Valerio Pascucci,et al.  Visualizing High-Dimensional Data: Advances in the Past Decade , 2017, IEEE Transactions on Visualization and Computer Graphics.

[25]  Boris Müller,et al.  Probing Projections: Interaction Techniques for Interpreting Arrangements and Errors of Dimensionality Reductions , 2016, IEEE Transactions on Visualization and Computer Graphics.

[26]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[27]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[28]  Melanie Tory,et al.  Visualizing Dimension Coverage to Support Exploratory Analysis , 2017, IEEE Transactions on Visualization and Computer Graphics.

[29]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[30]  Tamara Munzner,et al.  DimStiller: Workflows for dimensional analysis and reduction , 2010, 2010 IEEE Symposium on Visual Analytics Science and Technology.

[31]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[32]  Enrico Bertini,et al.  Quality Metrics in High-Dimensional Data Visualization: An Overview and Systematization , 2011, IEEE Transactions on Visualization and Computer Graphics.

[33]  Valerio Pascucci,et al.  Visual Exploration of High‐Dimensional Data through Subspace Analysis and Dynamic Projections , 2015, Comput. Graph. Forum.

[34]  Elmar Eisemann,et al.  Hierarchical Stochastic Neighbor Embedding , 2016, Comput. Graph. Forum.

[35]  Daniel A. Keim,et al.  Subspace search and visualization to make sense of alternative clusterings in high-dimensional data , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[36]  Hans-Peter Kriegel,et al.  A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms , 2008, SSDBM.

[37]  Emmanuel J. Candès,et al.  Robust Subspace Clustering , 2013, ArXiv.

[38]  Xiaoru Yuan,et al.  Dimension Projection Matrix/Tree: Interactive Subspace Visual Exploration and Analysis of High Dimensional Data , 2013, IEEE Transactions on Visualization and Computer Graphics.

[39]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[40]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[41]  Roberto Tron RenVidal A Benchmark for the Comparison of 3-D Motion Segmentation Algorithms , 2007 .

[42]  Marcus A. Magnor,et al.  Combining automated analysis and visualization techniques for effective exploration of high-dimensional data , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.