Fast Training of Nonlinear Embedding Algorithms

Stochastic neighbor embedding (SNE) and related nonlinear manifold learning algorithms achieve high-quality low-dimensional representations of similarity data, but are notoriously slow to train. We propose a generic formulation of embedding algorithms that includes SNE and other existing algorithms, and study their relation with spectral methods and graph Laplacians. This allows us to define several partial-Hessian optimization strategies, characterize their global and local convergence, and evaluate them empirically. We achieve up to two orders of magnitude speedup over existing training methods with a strategy (which we call the spectral direction) that adds nearly no overhead to the gradient and yet is simple, scalable and applicable to several existing and future embedding algorithms. We consider a well-known formulation of dimensionality reduction: we are given a matrix of N ×N (dis)similarity values, corresponding to pairs of high-dimensional points y1, . . . ,yN (objects), which need not be explicitly given, and we want to obtain corresponding low-dimensional pointsx1, . . . ,xN ∈ R (images) whose Euclidean distances optimally preserve the similarities. Methods of thi s type have been widely used, often for 2D visualization, in all sort of applications (notably, in psychology). They include multidimensional scaling (originating in psychomet rics and statistics; Borg & Groenen, 2005) and its variants such as Sammon’s mapping ( Sammon, 1969), PCA defined on the Gram matrix, and several methods recently developed in machine learning: spectral methods such as Laplacian eigenmaps ( Belkin & Niyogi, 2003) or locally linear embedding ( Roweis & Saul , 2000), convex formulations such as maximum variance unfolding ( Weinberger & Saul , 2006), and nonconvex formulations such as stochastic neighbor embedding (SNE; Hinton & Roweis, 2003) and Appearing inProceedings of the29 th International Conference on Machine Learning , Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). its variations (symmetric SNE, s-SNE: Cook et al. , 2007; Venna & Kaski , 2007; t-SNE: van der Maaten & Hinton , 2008); kernel information embedding ( Memisevic, 2006); and the elastic embedding (EE; Carreira-Perpĩ nán, 2010). Spectral methods have become very popular because they have a unique solution that can be efficiently computed by a sparse eigensolver, and yet they are able to unfold nonlinear, convoluted manifolds. That said, their embeddings are far from perfect, particularly when the data has nonuniform density or multiple manifolds. Better results have been obtained by the nonconvex methods, whose objective functions better characterize the desired embeddings. Carreira-Perpĩ nán (2010) showed that several of these methods (e.g. SNE, EE) add a point-separating term to the Laplacian eigenmaps objective. This causes improved embeddings: images of nearby objects are encouraged to project nearby but, also, images of distant objects are encouraged to project far away. However, a fundamental problem with nonconvex methods, echoed in most of the papers mentioned, has been their difficult optimization. First, they can converge to bad local optima. In practice, this can be countered by using a good initialization (e.g. from spectral methods), by simulated annealing (e.g. adding noise to the updates;Hinton & Roweis, 2003) or by homotopy methods (Memisevic, 2006; Carreira-Perpĩ nán, 2010). Second, numerical optimization has been found to be very slow. Most previous work has used simple algorithms, some adapted from the neural net literature, such as gradient descent wit h momentum and adaptive learning rate, or conjugate gradients. These optimizers are very slow with ill-conditioned problems and have limited the applicability of nonlinear embedding methods to small datasets; hours of training for a few thousand points are typical, which rules out interactive visualization and allows only a coarse model selection . Our goal in this paper is to devise training algorithms that are not only significantly faster but also scale up to larger datasets and generalize over a family of embedding algorithms (SNE,t-SNE, EE and others). We do this not by simply using an off-the-shelf optimizer, but by understand ing the common structure of the Hessian in these algoPartial-Hessian Strategies for Fast Learning of Nonlinear Embeddings rithms and their relation with the graph Laplacian of spectral methods. Thus, our first task is to provide a general formulation of nonconvex embeddings (section 1) and understand their Hessian structure, resulting in several opt imization strategies (section 2). We then empirically evaluate them (section 3) and conclude by recommending a strategy that is simple, generic, scalable and typically (but not always) fastest—by up to two orders of magnitude over existing methods. Throughout we write pd (psd) to mean positive (semi)definite, and likewise nd (nsd). 1. A General Embeddings Formulation Call X = (x1, . . . ,xN ) thed × N matrix of low-dimensional points, and define an objective function: E(X;λ) = E(X) + λE(X) λ ≥ 0 (1) whereE is theattractive term, which is often quadratic psd and minimal with coincident points, and E is therepulsive term, which is often nonlinear and minimal when points separate infinitely. Optimal embeddings balance both forces. Both terms depend on X through Euclidean distances between points and thus are shift and rotation invariant. We obtain several important special cases: Normalized symmetric methods minimize the KL divergence between a posterior probability distribution Q over each point pair normalized by the sum over all point pairs (whereK is a kernel function): qnm = K(‖xn − xm‖ 2 ) ∑N n,m=1 K(‖xn′ − xm′‖ 2 ) , qnn = 0 and a distributionP analogously defined on the data Y (thus constant wrtX) with possibly a different kernel and width. This is equivalent to choosing E(X) = − ∑N n,m=1 pnm logK(‖xn − xm‖ 2 ), E(X) = log ∑N n,m=1 K(‖xn − xm‖ 2 ) andλ = 1 in eq. (1). Particular cases are s-SNE ( Cook et al. , 2007) andt-SNE, with Gaussian and Student’s t kernels, resp. We will call pnm = w nm from now on. Normalized nonsymmetric methodsconsider instead per-point distributionsPn andQn, as in the original SNE (Hinton & Roweis, 2003). Their expressions are more complicated and we focus here on the symmetric ones. Unnormalized models dispense with distributions and are simpler. For a Gaussian kernel, in the elastic embedding (EE;Carreira-Perpĩ nán, 2010) we have E(X) = ∑N n,m=1 w + nm ‖xn − xm‖ 2 and E(X) = ∑N n,m=1 w − nme −‖xn−xm‖ 2 , whereW and W are symmetric nonnegative (with w nn = w nn = 0, n = 1, . . . , N ). Spectral methods such as Laplacian eigenmaps or LLE define E(X) = ∑N n,m=1 w + nm ‖xn − xm‖ 2 and E(X) = 0, with nonnegative affinitiesW, but add quadratic constraints to prevent the trivial solution X = 0. SoE is as in EE and SNE. This formulation suggests previously unexplored algorithms, such as using an Epanechnikov kernel, or a t-EE, or using homotopy algorithms for SNE/ t-SNE, where we follow the optimal pathX(λ) from λ = 0 (whereX = 0) to λ = 1. It can also be extended to closely related methods for embedding (kernel information embedding; Memisevic, 2006) and metric learning (neighborhood component analysis; Goldberger et al. , 2005), among others. We express the gradient and Hessian (written as matrices of d×N andNd×Nd, resp.) in terms of Laplacians, following Carreira-Perpĩ nán(2010), as opposed to the forms used in the SNE papers. This brings out the relation with spectral methods and simplifies the task of finding pd terms. Given anN ×N symmetric matrix of weightsW = (wnm), we define its graph Laplacian matrix as L = D − W where D = diag ( ∑N n=1 wnm) is the degree matrix. Likewise we getL fromw nm,L q fromw nm, etc.L is psd ifW is nonnegative (sinceuLu = 1 2 ∑N n,m=1 wnm(un − um) 2 ≥ 0). The Laplacians below always assume summation over points, so that the dimension-dependent Nd × Nd LaplacianL (from weightsw in,jm) is really anN ×N Laplacian for each(i, j) point dimension. All other Laplacians are dimension-independent, of N ×N . Using this convention, we have for normalized symmetric models:

[1]  Geoffrey E. Hinton,et al.  Improving dimensionality reduction with spectral gradient descent , 2005, Neural Networks.

[2]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[3]  Kilian Q. Weinberger,et al.  Unsupervised Learning of Image Manifolds by Semidefinite Programming , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[4]  Miguel Á. Carreira-Perpiñán,et al.  The Elastic Embedding Algorithm for Dimensionality Reduction , 2010, ICML.

[5]  Jarkko Venna,et al.  Nonlinear Dimensionality Reduction as Information Retrieval , 2007, AISTATS.

[6]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[7]  Geoffrey E. Hinton,et al.  Visualizing Similarity Data with a Mixture of Maps , 2007, AISTATS.

[8]  Ramani Duraiswami,et al.  The improved fast Gauss transform with applications to machine learning , 2005 .

[9]  Leslie Greengard,et al.  The Fast Gauss Transform , 1991, SIAM J. Sci. Comput..

[10]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[11]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[12]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[13]  Jan de Leeuw,et al.  Modern Multidimensional Scaling: Theory and Applications (Second Edition) , 2005 .

[14]  Roland Memisevic,et al.  Kernel information embeddings , 2006, ICML.

[15]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.