DD-HDS: A Method for Visualization and Exploration of High-Dimensional Data

Mapping high-dimensional data in a low-dimensional space, for example, for visualization, is a problem of increasingly major concern in data analysis. This paper presents data-driven high-dimensional scaling (DD-HDS), a nonlinear mapping method that follows the line of multidimensional scaling (MDS) approach, based on the preservation of distances between pairs of data. It improves the performance of existing competitors with respect to the representation of high-dimensional data, in two ways. It introduces (1) a specific weighting of distances between data taking into account the concentration of measure phenomenon and (2) a symmetric handling of short distances in the original and output spaces, avoiding false neighbor representations while still allowing some necessary tears in the original distribution. More precisely, the weighting is set according to the effective distribution of distances in the data set, with the exception of a single user-defined parameter setting the tradeoff between local neighborhood preservation and global mapping. The optimization of the stress criterion designed for the mapping is realized by ldquoforce-directed placementrdquo (FDP). The mappings of low- and high-dimensional data sets are presented as illustrations of the features and advantages of the proposed algorithm. The weighting function specific to high-dimensional data and the symmetric handling of short distances can be easily incorporated in most distance preservation-based nonlinear dimensionality reduction methods.

[1]  Tommy W. S. Chow,et al.  PRSOM: a new visualization method by hybridizing multidimensional scaling and self-organizing map , 2005, IEEE Transactions on Neural Networks.

[2]  Steven Skiena,et al.  Implementing discrete mathematics - combinatorics and graph theory with Mathematica , 1990 .

[3]  Michel Verleysen,et al.  Nonlinear Projection with the Isotop Method , 2002, ICANN.

[4]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[5]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[6]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[7]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[8]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[9]  Michel Verleysen,et al.  Nonlinear projection with curvilinear distances: Isomap versus curvilinear distance analysis , 2004, Neurocomputing.

[10]  Kilian Q. Weinberger,et al.  Nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization , 2005, AISTATS.

[11]  Michel Verleysen,et al.  Nonlinear dimensionality reduction of data manifolds with essential loops , 2005, Neurocomputing.

[12]  James Xinzhi Li Visualization of High-Dimensional Data with Relational Perspective Map , 2004, Inf. Vis..

[13]  Alain Giron,et al.  Detection and characterization of horizontal transfers in prokaryotes using genomic signature , 2005, Nucleic acids research.

[14]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[15]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[16]  Andreas Ludwig,et al.  A Fast Adaptive Layout Algorithm for Undirected Graphs , 1994, GD.

[17]  Michel Verleysen,et al.  Locally Linear Embedding versus Isotop , 2003, ESANN.

[18]  Michel Verleysen,et al.  About the locality of kernels in high-dimensional spaces , 2005 .

[19]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[20]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[21]  Antoine Danchin,et al.  Relationship of SARS-CoV to other pathogenic RNA viruses explored by tetranucleotide usage profiling , 2003, BMC Bioinformatics.

[22]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[23]  Matthew Chalmers,et al.  Fast Multidimensional Scaling Through Sampling, Springs and Interpolation , 2003, Inf. Vis..

[24]  Hujun Yin,et al.  Data visualisation and manifold mapping using the ViSOM , 2002, Neural Networks.

[25]  Jeanny Hérault,et al.  Searching for the embedded manifolds in high-dimensional data, problems and unsolved questions , 2002, ESANN.

[26]  Joaquin Quiñonero Candela,et al.  Local distance preservation in the GP-LVM through back constraints , 2006, ICML.

[27]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[28]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[29]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[30]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[31]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[32]  Geoffrey E. Hinton,et al.  Global Coordination of Local Linear Models , 2001, NIPS.

[33]  Amaury Lendasse,et al.  A robust nonlinear projection method , 2000 .

[34]  Peter Eades,et al.  A Heuristic for Graph Drawing , 1984 .

[35]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[36]  Alain Giron,et al.  Exploration of phylogenetic data using a global sequence analysis method , 2005, BMC Evolutionary Biology.

[37]  Edward M. Reingold,et al.  Graph drawing by force‐directed placement , 1991, Softw. Pract. Exp..

[38]  Stephen P. Boyd,et al.  The Fastest Mixing Markov Process on a Graph and a Connection to a Maximum Variance Unfolding Problem , 2006, SIAM Rev..

[39]  N. Quinn,et al.  A forced directed component placement procedure for printed circuit boards , 1979 .

[40]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[41]  Desire L. Massart,et al.  Kernel-PCA algorithms for wide data Part II: Fast cross-validation and application in classification of NIR data , 1997 .

[42]  M. Breuer,et al.  Correction to 'A Force Directed Component Placement Procedure for Printed Circuit Boards' , 1980 .

[43]  Michael E. Tipping,et al.  Feed-forward neural networks and topographic mappings for exploratory data analysis , 1996, Neural Computing & Applications.

[44]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[45]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[46]  Trevor F. Cox,et al.  Metric multidimensional scaling , 2000 .

[47]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[48]  Colin Reeves Genetic Algorithms , 2003, Handbook of Metaheuristics.

[49]  Jeanny Hérault,et al.  Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets , 1997, IEEE Trans. Neural Networks.

[50]  Christopher M. Bishop,et al.  GTM: The Generative Topographic Mapping , 1998, Neural Computation.

[51]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[52]  Matthew Chalmers,et al.  A linear iteration time layout algorithm for visualising high-dimensional data , 1996, Proceedings of Seventh Annual IEEE Visualization '96.

[53]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[54]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[55]  Michel Verleysen,et al.  A robust non-linear projection method , 2000, ESANN.

[56]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[57]  D. Donoho,et al.  Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[58]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.

[59]  Hujun Yin,et al.  ViSOM - a novel method for multivariate data projection and structure visualization , 2002, IEEE Trans. Neural Networks.

[60]  Ata Kabán,et al.  A Combined Latent Class and Trait Model for the Analysis and Visualization of Discrete Data , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[61]  P. Deschavanne,et al.  Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. , 1999, Molecular biology and evolution.

[62]  Alain Giron,et al.  A genomic schism in birds revealed by phylogenetic analysis of DNA strings. , 2002, Systematic biology.

[63]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[64]  S. Wold,et al.  The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses , 1984 .

[65]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .