Isometry and convexity in dimensionality reduction

In this dissertation we study two current state of the art dimensionality reduction methods, Maximum Variance Unfolding (MVU) and Non-Negative Matrix Factorization (NMF). These two dimensionality reduction methods do not fit under the umbrella of Kernel PCA. MVU is cast as a Semidefinite Program, a modern convex nonlinear optimization algorithm, that offers more flexibility and power compared to KPCA. Although MVU and NMF seem to be two disconnected problems, we show that there is a connection between them. Both are special cases of a general nonlinear factorization algorithm that we developed. Two aspects of the algorithms are of particular interest: computational complexity and interpretability. In other words computational complexity answers the question of how fast we can find the best solution of MVU/NMF for large data volumes. Since we are dealing with optimization programs, we need to find the global optimum. Global optimum is strongly connected with the convexity of the problem. Interpretability is strongly connected with local isometry1 that gives meaning in relationships between data points. Another aspect of interpretability is association of data with labeled information. The contributions of this thesis are the following: (1) MVU is modified so that it can scale more efficient. Results are shown on 1 million speech datasets. Limitations of the method are highlighted. (2) An algorithm for fast computations for the furthest neighbors is presented for the first time in the literature. (3) Construction of optimal kernels for Kernel Density Estimation with modern convex programming is presented. For the first time we show that the Leave One Cross Validation (LOOCV) function is quasi-concave. (4) For the first time NMF is formulated as a convex optimization problem. (5) An algorithm for the problem of Completely Positive Matrix Factorization is presented. (6) A hybrid algorithm of MVU and NMF the isoNMF is presented combining advantages of both methods. (7) The Isometric Separation Maps (ISM) a variation of MVU that contains classification information is presented. (8) Large scale nonlinear dimensional analysis on the TIMIT speech database is performed. (9) A general nonlinear factorization algorithm is presented based on sequential convex programming. Despite the efforts to scale the proposed methods up to 1 million data points in reasonable time, the gap between the industrial demand and the current state of the art is still orders of magnitude wide. 1Preservation of local distances.

[1]  Stephen P. Boyd,et al.  A rank minimization heuristic with application to minimum order system approximation , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).

[2]  Alexander G. Gray,et al.  Learning dissimilarities by ranking: from SDP to QP , 2008, ICML '08.

[3]  Hongyuan Zha,et al.  Convergence and Rate of Convergence of a Manifold-Based Dimension Reduction Algorithm , 2008, NIPS.

[4]  N. Vasiloglou,et al.  Scalable semidefinite manifold learning , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.

[5]  Dimitris K. Agrafiotis,et al.  Stochastic proximity embedding , 2003, J. Comput. Chem..

[6]  Bernhard Schölkopf,et al.  A kernel view of the dimensionality reduction of manifolds , 2004, ICML.

[7]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[8]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[9]  A. Berman,et al.  Completely Positive Matrices , 2003 .

[10]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[11]  Alexander G. Gray,et al.  QUIC-SVD: Fast SVD Using Cosine Trees , 2008, NIPS.

[12]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[13]  R. Horst,et al.  Global Optimization: Deterministic Approaches , 1992 .

[14]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[15]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[16]  Richard M. Stern,et al.  Efficient Cepstral Normalization for Robust Speech Recognition , 1993, HLT.

[17]  Glenn Fung,et al.  Learning sparse metrics via linear programming , 2006, KDD '06.

[18]  C. D. Meyer,et al.  Initializations for the Nonnegative Matrix Factorization , 2006 .

[19]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[20]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[21]  John M. Lee Introduction to Smooth Manifolds , 2002 .

[22]  Christoph Schnörr,et al.  Learning non-negative sparse image codes by convex programming , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  Samuel Burer,et al.  Computable representations for convex hulls of low-dimensional quadratic forms , 2010, Math. Program..

[24]  Tony Jebara,et al.  Minimum Volume Embedding , 2007, AISTATS.

[25]  Kilian Q. Weinberger,et al.  Nonlinear Dimensionality Reduction by Semidefinite Programming and Kernel Matrix Factorization , 2005, AISTATS.

[26]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[27]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[28]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[29]  Rolph E. Anderson,et al.  Multivariate data analysis with readings (2nd ed.) , 1986 .

[30]  Jos F. Sturm,et al.  A Matlab toolbox for optimization over symmetric cones , 1999 .

[31]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[32]  Christos Boutsidis,et al.  SVD based initialization: A head start for nonnegative matrix factorization , 2008, Pattern Recognit..

[33]  Vikas Sindhwani,et al.  On Manifold Regularization , 2005, AISTATS.

[34]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[35]  Christoph Schnörr,et al.  Controlling Sparseness in Non-negative Tensor Factorization , 2006, ECCV.

[36]  Samuel Burer,et al.  On the copositive representation of binary and continuous nonconvex quadratic programs , 2009, Math. Program..

[37]  Stephen J. Wright Primal-Dual Interior-Point Methods , 1997, Other Titles in Applied Mathematics.

[38]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[39]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[40]  D.V. Anderson,et al.  Parameter Estimation for Manifold Learning, Through Density Estimation , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[41]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.

[42]  Jon C. Dattorro,et al.  Convex Optimization & Euclidean Distance Geometry , 2004 .

[43]  B. Schölkopf,et al.  Graph Laplacian Regularization for Large-Scale Semidefinite Programming , 2007 .

[44]  R. Saigal,et al.  Handbook of semidefinite programming : theory, algorithms, and applications , 2000 .

[45]  Ameet Talwalkar,et al.  Large-scale manifold learning , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[47]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[48]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[49]  Ben Pinkowski Principal component analysis of speech spectrogram images , 1997, Pattern Recognit..

[50]  Stephen P. Boyd,et al.  Learning the kernel via convex optimization , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  Kilian Q. Weinberger,et al.  Unsupervised Learning of Image Manifolds by Semidefinite Programming , 2004, CVPR.

[52]  Kilian Q. Weinberger,et al.  An Introduction to Nonlinear Dimensionality Reduction by Maximum Variance Unfolding , 2006, AAAI.

[53]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[54]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[55]  D. Donoho,et al.  Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Stephanie Seneff,et al.  Transcription and Alignment of the TIMIT Database , 1996 .

[57]  Aren Jansen,et al.  Intrinsic Fourier Analysis on the Manifold of Speech Sounds , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[58]  Le Thi Hoai An,et al.  Difference of convex functions optimization algorithms (DCA) for globally minimizing nonconvex quadratic forms on Euclidean balls and spheres , 1996, Operations Research Letters.

[59]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[60]  Stephen P. Boyd,et al.  Semidefinite Programming , 1996, SIAM Rev..

[61]  Renato D. C. Monteiro,et al.  A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization , 2003, Math. Program..

[62]  Christodoulos A. Floudas,et al.  Deterministic global optimization - theory, methods and applications , 2010, Nonconvex optimization and its applications.

[63]  N. Vasiloglou,et al.  Density Preserving Maps , 2011 .

[64]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[65]  Bert Huang Maximum Likelihood Graph Structure Estimation with Degree Distributions , 2008 .

[66]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[67]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[68]  Benjamin M. Marlin,et al.  Collaborative Filtering: A Machine Learning Perspective , 2004 .

[69]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[70]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[71]  David J. Kriegman,et al.  Generalized Non-metric Multidimensional Scaling , 2007, AISTATS.

[72]  Stefanie Jegelka,et al.  Scalable Semidefinite Programming using Convex Perturbations , 2007 .

[73]  John C. Platt,et al.  Fast Low-Rank Semidefinite Programming for Embedding and Clustering , 2007, AISTATS.

[74]  Nathan Srebro,et al.  Learning with matrix factorizations , 2004 .

[75]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[76]  Adam Krzyzak,et al.  A Distribution-Free Theory of Nonparametric Regression , 2002, Springer series in statistics.

[77]  Le Thi Hoai An,et al.  D.C. Programming Approach to the Multidimensional Scaling Problem , 2001 .

[78]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[79]  Timothy J. Hazen,et al.  Dimensionality reduction for speech recognition using neighborhood components analysis , 2007, INTERSPEECH.

[80]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[81]  Christoph Schnörr,et al.  Reverse-Convex Programming for Sparse Image Codes , 2005, EMMCVPR.

[82]  Le Song,et al.  Colored Maximum Variance Unfolding , 2007, NIPS.

[83]  M. Kaykobad On Nonnegative Factorization of Matrices , 1987 .