Unifying linear dimensionality reduction

Linear dimensionality reduction methods are a cornerstone of analyzing high dimensional data, due to their simple geometric interpretations and typically attractive computational properties. These methods capture many data features of interest, such as covariance, dynamical structure, correlation between data sets, input-output relationships, and margin between data classes. Methods have been developed with a variety of names and motivations in many fields, and perhaps as a result the deeper connections between all these methods have not been understood. Here we unify methods from this disparate literature as optimization programs over matrix manifolds. We discuss principal component analysis, factor analysis, linear multidimensional scaling, Fisher's linear discriminant analysis, canonical correlations analysis, maximum autocorrelation factors, slow feature analysis, undercomplete independent component analysis, linear regression, and more. This optimization framework helps elucidate some rarely discussed shortcomings of well-known methods, such as the suboptimality of certain eigenvector solutions. Modern techniques for optimization over matrix manifolds enable a generic linear dimensionality reduction solver, which accepts as input data and an objective to be optimized, and returns, as output, an optimal low-dimensional projection of the data. This optimization framework further allows rapid development of novel variants of classical methods, which we demonstrate here by creating an orthogonal-projection canonical correlations analysis. More broadly, we suggest that our generic linear dimensionality reduction solver can move linear dimensionality reduction toward becoming a blackbox, objective-agnostic numerical technology.

[1]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[2]  C. Spearman General intelligence Objectively Determined and Measured , 1904 .

[3]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[4]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[5]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[6]  C. R. Rao,et al.  The Utilization of Multiple Measurements in Problems of Biological Classification , 1948 .

[7]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[8]  A. Hoffman,et al.  Some metric inequalities in the space of matrices , 1955 .

[9]  L. Mirsky SYMMETRIC GAUGE FUNCTIONS AND UNITARILY INVARIANT NORMS , 1960 .

[10]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[11]  D. Luenberger The Gradient Projection Method Along Geodesics , 1972 .

[12]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[13]  D. Bertsekas On the Goldstein-Levitin-Polyak gradient projection method , 1974, CDC 1974.

[14]  C. Theobald An inequality with application to multivariate analysis , 1975 .

[15]  D. Gabay Minimizing a differentiable function over a differential manifold , 1982 .

[16]  D. Freedman,et al.  Asymptotics of Graphical Projection Pursuit , 1984 .

[17]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[18]  E. Aronson,et al.  Theory and method , 1985 .

[19]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[20]  D. Hawkins,et al.  Methods of L1 estimation of a covariance matrix , 1987 .

[21]  J. Friedman Exploratory Projection Pursuit , 1987 .

[22]  N. Higham MATRIX NEARNESS PROBLEMS AND APPLICATIONS , 1989 .

[23]  M. Stone Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least s , 1990 .

[24]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[25]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[26]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[27]  P. Sabatier A L 1 -norm Pca and a Heuristic Approach , 1996 .

[28]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[29]  Bernhard Schölkopf,et al.  Kernel Principal Component Analysis , 1997, ICANN.

[30]  Sam T. Roweis,et al.  EM Algorithms for PCA and Sensible PCA , 1997, NIPS 1997.

[31]  John Porrill,et al.  Undercomplete Independent Component Analysis for Signal Separation and Dimension Reduction , 1997 .

[32]  H. Knutsson,et al.  A Unified Approach to PCA, PLS, MLR and CCA , 1997 .

[33]  Alan Edelman,et al.  The Geometry of Algorithms with Orthogonality Constraints , 1998, SIAM J. Matrix Anal. Appl..

[34]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[35]  Liqing Zhang,et al.  Natural gradient algorithm for blind separation of overdetermined mixture with additive noise , 1999, IEEE Signal Processing Letters.

[36]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[37]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[38]  Shun-ichi Amari,et al.  Natural Gradient Learning for Over- and Under-Complete Bases in ICA , 1999, Neural Computation.

[39]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[40]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[41]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[42]  Russell H. Lambert,et al.  OVERDETERMINED BLIND SOURCE SEPARATION: USING MORE SENSORS THAN SOURCE SIGNALS IN A NOISY MIXTURE , 2000 .

[43]  Sanjoy Dasgupta,et al.  A Generalization of Principal Components Analysis to the Exponential Family , 2001, NIPS.

[44]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[45]  Wray L. Buntine Variational Extensions to EM and Multinomial PCA , 2002, ECML.

[46]  N. H. Timm Applied Multivariate Analysis , 2002 .

[47]  Dominique Martinez,et al.  Kernel-Based Extraction of Slow Features: Complex Cells Learn Disparity and Translation Invariance from Natural Images , 2002, NIPS.

[48]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[49]  Christopher K. I. Williams,et al.  Products of Gaussians and Probabilistic Minor Component Analysis , 2002, Neural Computation.

[50]  R. Larsen Decomposition using maximum autocorrelation factors , 2002 .

[51]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[52]  Jonathan H. Manton,et al.  Optimization algorithms exploiting unitary constraints , 2002, IEEE Trans. Signal Process..

[53]  Josef Kittler,et al.  Texture Description by Independent Components , 2002, SSPR/SPR.

[54]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[55]  Max Welling,et al.  Extreme Components Analysis , 2003, NIPS.

[56]  Laurenz Wiskott,et al.  Slow Feature Analysis: A Theoretical Analysis of Optimal Free Responses , 2003, Neural Computation.

[57]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[58]  Tommi S. Jaakkola,et al.  Weighted Low-Rank Approximations , 2003, ICML.

[59]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[60]  Geoffrey E. Hinton,et al.  Probabilistic sequential independent components analysis , 2004, IEEE Transactions on Neural Networks.

[61]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[62]  E. Oja,et al.  Independent Component Analysis , 2001 .

[63]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[64]  Bernhard Schölkopf,et al.  Multivariate Regression via Stiefel Manifold Constraints , 2004, DAGM-Symposium.

[65]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[66]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, NIPS 2004.

[67]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[68]  Kilian Q. Weinberger,et al.  Unsupervised Learning of Image Manifolds by Semidefinite Programming , 2004, CVPR.

[69]  Christopher K. I. Williams On a Connection between Kernel PCA and Metric Multidimensional Scaling , 2004, Machine Learning.

[70]  Anuj Srivastava,et al.  Tools for application-driven linear dimension reduction , 2005, Neurocomputing.

[71]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[72]  Shotaro Akaho,et al.  Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold , 2005, Neurocomputing.

[73]  Nathan Srebro,et al.  Fast maximum margin matrix factorization for collaborative prediction , 2005, ICML.

[74]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[75]  Simone G. O. Fiori,et al.  Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial , 2005, J. Mach. Learn. Res..

[76]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.

[77]  Shuicheng Yan,et al.  Neighborhood preserving embedding , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[78]  Vartan Choulakian,et al.  L1-norm projection pursuit principal component analysis , 2006, Comput. Stat. Data Anal..

[79]  Shuicheng Yan,et al.  Trace Quotient Problems Revisited , 2006, ECCV.

[80]  Hans-Peter Kriegel,et al.  Supervised probabilistic principal component analysis , 2006, KDD '06.

[81]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[82]  Lorenzo Torresani,et al.  Large Margin Component Analysis , 2006, NIPS.

[83]  Deli Zhao,et al.  Laplacian PCA and Its Applications , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[84]  Richard E. Turner,et al.  A Maximum-Likelihood Interpretation for Slow Feature Analysis , 2007, Neural Computation.

[85]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[86]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[87]  Michael I. Jordan,et al.  Regression on manifolds using kernel dimension reduction , 2007, ICML '07.

[88]  Golub Gene H. Et.Al Matrix Computations, 3rd Edition , 2007 .

[89]  Samuel Kaski,et al.  Fast Semi-Supervised Discriminative Component Analysis , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[90]  Hongdong Li,et al.  A Convex Programming Approach to the Trace Quotient Problem , 2007 .

[91]  Mark W. Schmidt,et al.  Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches , 2007, ECML.

[92]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[93]  Alexandre d'Aspremont,et al.  Optimal Solutions for Sparse Principal Component Analysis , 2007, J. Mach. Learn. Res..

[94]  Visa Koivunen,et al.  Steepest Descent Algorithms for Optimization Under Unitary Matrix Constraint , 2008, IEEE Transactions on Signal Processing.

[95]  Katherine A. Heller,et al.  Bayesian Exponential Family PCA , 2008, NIPS.

[96]  Victor Solo,et al.  Sparse Variable PCA Using Geodesic Steepest Descent , 2008, IEEE Transactions on Signal Processing.

[97]  John Shawe-Taylor,et al.  Convergence analysis of kernel Canonical Correlation Analysis: theory and practice , 2008, Machine Learning.

[98]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[99]  R. Cook,et al.  Sufficient dimension reduction and prediction in regression , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[100]  Michael I. Jordan,et al.  Kernel dimension reduction in regression , 2009, 0908.1854.

[101]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[102]  Jieping Ye,et al.  A least squares formulation for a class of generalized eigenvalue problems in machine learning , 2009, ICML '09.

[103]  Samy Bengio,et al.  An Online Algorithm for Large Scale Image Similarity Learning , 2009, NIPS.

[104]  Christopher J. C. Burges,et al.  Dimension Reduction: A Guided Tour , 2010, Found. Trends Mach. Learn..

[105]  Michael I. Jordan,et al.  Unsupervised Kernel Dimension Reduction , 2010, NIPS.

[106]  Faculteit Elektrotechniek Sparse principal component analysis Ijle principale componenten analyse , 2010 .

[107]  Yurii Nesterov,et al.  Generalized Power Method for Sparse Principal Component Analysis , 2008, J. Mach. Learn. Res..

[108]  Anuj Srivastava,et al.  Optimal linear projections for enhancing desired data statistics , 2010, Stat. Comput..

[109]  Christian K. Machens,et al.  Behavioral / Systems / Cognitive Functional , But Not Anatomical , Separation of “ What ” and “ When ” in Prefrontal Cortex , 2009 .

[110]  Barak Blumenfeld An Algorithm for the Analysis of Temporally Structured Multidimensional Measurements , 2010, Front. Comput. Neurosci..

[111]  Yujia Wang,et al.  Overdetermined Blind Source Separation by Gaussian Mixture Model , 2011, ICIC.

[112]  Kush R. Varshney,et al.  Linear Dimensionality Reduction for Margin-Based Classification: High-Dimensional Data and Sensor Networks , 2011, IEEE Transactions on Signal Processing.

[113]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[114]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[115]  Xiyan Hu,et al.  Procrustes problems and associated approximation problems for matrices with k-involutory symmetries , 2011 .

[116]  Wieland Brendel,et al.  Demixed Principal Component Analysis , 2011, NIPS.

[117]  Neil D. Lawrence,et al.  A Unifying Probabilistic Perspective for Spectral Dimensionality Reduction: Insights and New Models , 2010, J. Mach. Learn. Res..

[118]  F. de la Torre A least-squares framework for Component Analysis. , 2012, IEEE transactions on pattern analysis and machine intelligence.

[119]  Lawrence K. Saul,et al.  Latent Coincidence Analysis: A Hidden Variable Model for Distance Metric Learning , 2012, NIPS.

[120]  Jérôme Malick,et al.  Projection-like Retractions on Matrix Manifolds , 2012, SIAM J. Optim..

[121]  Matthew T. Kaufman,et al.  Neural population dynamics during reaching , 2012, Nature.

[122]  Toshihisa Tanaka,et al.  Empirical Arithmetic Averaging Over the Compact Stiefel Manifold , 2013, IEEE Transactions on Signal Processing.

[123]  Yi-Hao Kao,et al.  Learning a factor model via regularized PCA , 2011, Machine Learning.

[124]  Bamdev Mishra,et al.  Manopt, a matlab toolbox for optimization on manifolds , 2013, J. Mach. Learn. Res..

[125]  Y. Takane,et al.  Multidimensional Scaling I , 2015 .