Geometric optimization algorithms for linear regression on fixed-rank matrices

Nowadays, large and rapidly evolving data sets are commonly encountered in many modern applications. Efficiently mining and exploiting these data sets generally results in the extraction of valuable information and therefore appears as an important challenge in various domains including network security, computer vision, internet search engines, bioinformatics, marketing systems, online advertisement, social networks, just to name a few. The rapid development of these modern computer science applications sustains an everincreasing demand for efficient machine learning algorithms that can cope with large-scale problems, characterized by a large number of samples and a large number of variables. The research reported in the present thesis is devoted to the design of efficient machine learning algorithms for large-scale problems. Specifically, we adopt a geometric optimization viewpoint to address the problem of linear regression in nonlinear and high-dimensional matrix search spaces. Our purpose is to efficiently exploit the geometric structure of the search space in the design of scalable linear regression algorithms. Our search space of main interest will be the set of low-rank matrices. Learning a low-rank matrix is a typical approach to cope with high-dimensional problems. The low-rank constraint is expected to force the learning algorithm to capture a limited number of dominant factors that mostly influence the sought solution. We consider both the learning of a fixed-rank symmetric positive semidefinite matrix and of a fixed-rank non-symmetric matrix. A first contribution of the thesis is to show that many modern machine learning problems can be formulated as linear regression problems on the set of fixed-rank matrices. For example, the learning of a low-rank distance, low-rank matrix completion and the learning on data pairs are cast into the considered linear regression framework. For these problems, the low-rank constraint is either part of the original problem formulation or is a sound approximation that significantly reduces the original problem size and complexity, resulting in a dramatic decrease in the computational complexity of algorithms. Our main contribution is the development of novel efficient algorithms for learning a linear regression model parameterized by a fixed-rank matrix. The resulting algorithms preserve the underlying geometric structure of the problem, scale to high-dimensional problems, enjoy local convergence properties and confer a geometric basis to recent contributions on learning fixedrank matrices. We thereby show that the considered geometric optimization framework offers a solid and versatile framework for the design of rank-constrained machine learning algorithms. The efficiency of the proposed algorithms is illustrated on several machine learning applications. Numerical experiments suggest that the proposed algorithms compete favorably with the state-of-the-art in terms of achieved performance and required computational time.

[1]  Bart Vandereycken,et al.  Low-Rank Matrix Completion by Riemannian Optimization , 2013, SIAM J. Optim..

[2]  Steven Thomas Smith,et al.  Geometric Optimization Methods for Adaptive Filtering , 2013, ArXiv.

[3]  Bart Vandereycken,et al.  A Riemannian geometry with complete geodesics for the set of positive semidefinite matrices of fixed rank , 2013 .

[4]  Dianne P. O'Leary,et al.  Euclidean distance matrix completion problems , 2012, Optim. Methods Softw..

[5]  Yaron Lipman,et al.  Sensor network localization by eigenvector synchronization over the euclidean group , 2012, TOSN.

[6]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[7]  Mark W. Schmidt,et al.  Hybrid Deterministic-Stochastic Methods for Data Fitting , 2011, SIAM J. Sci. Comput..

[8]  Bamdev Mishra,et al.  Low-rank optimization for distance matrix completion , 2011, IEEE Conference on Decision and Control and European Control Conference.

[9]  Silvere Bonnabel,et al.  Linear Regression under Fixed-Rank Constraints: A Riemannian Approach , 2011, ICML.

[10]  Trevor Darrell,et al.  What you saw is not what you get: Domain adaptation using asymmetric kernel transforms , 2011, CVPR 2011.

[11]  Sabine Van Huffel,et al.  Best Low Multilinear Rank Approximation of Higher-Order Tensors, Based on the Riemannian Trust-Region Scheme , 2011, SIAM J. Matrix Anal. Appl..

[12]  Silvere Bonnabel,et al.  Regression on Fixed-Rank Positive Semidefinite Matrices: A Riemannian Approach , 2010, J. Mach. Learn. Res..

[13]  David Gross,et al.  Recovering Low-Rank Matrices From Few Coefficients in Any Basis , 2009, IEEE Transactions on Information Theory.

[14]  Shiqian Ma,et al.  Fixed point and Bregman iterative methods for matrix rank minimization , 2009, Math. Program..

[15]  Gilles Louppe,et al.  A zealous parallel gradient descent algorithm , 2010 .

[16]  Daphna Weinshall,et al.  Online Learning in The Manifold of Low-Rank Matrices , 2010, NIPS.

[17]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[18]  Stefan Vandewalle,et al.  A Riemannian Optimization Approach for Computing Low-Rank Solutions of Lyapunov Equations , 2010, SIAM J. Matrix Anal. Appl..

[19]  Rodolphe Sepulchre,et al.  Adaptive filtering for estimation of a low-rank positive semidefinite matrix , 2010 .

[20]  Robert D. Nowak,et al.  Online identification and tracking of subspaces from highly incomplete information , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[21]  Boonserm Kijsirikul,et al.  A new kernelization framework for Mahalanobis distance learning algorithms , 2010, Neurocomputing.

[22]  Francis R. Bach,et al.  Low-Rank Optimization on the Cone of Positive Semidefinite Matrices , 2008, SIAM J. Optim..

[23]  Tapani Raiko,et al.  Tkk Reports in Information and Computer Science Practical Approaches to Principal Component Analysis in the Presence of Missing Values Tkk Reports in Information and Computer Science Practical Approaches to Principal Component Analysis in the Presence of Missing Values , 2022 .

[24]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[25]  L. Eldén,et al.  Grassmann algorithms for low rank approximation of matrices with missing values , 2010 .

[26]  Andrea Montanari,et al.  Regularization for matrix completion , 2010, 2010 IEEE International Symposium on Information Theory.

[27]  Inderjit S. Dhillon,et al.  Guaranteed Rank Minimization via Singular Value Projection , 2009, NIPS.

[28]  Andrea Montanari,et al.  Matrix Completion from Noisy Entries , 2009, J. Mach. Learn. Res..

[29]  Yoram Bresler,et al.  ADMiRA: Atomic Decomposition for Minimum Rank Approximation , 2009, IEEE Transactions on Information Theory.

[30]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[31]  S. V. N. Vishwanathan,et al.  A Quasi-Newton Approach to Nonsmooth Convex Optimization Problems in Machine Learning , 2008, J. Mach. Learn. Res..

[32]  Pablo A. Parrilo,et al.  Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization , 2007, SIAM Rev..

[33]  Inderjit S. Dhillon,et al.  Matrix Completion from Power-Law Distributed Samples , 2009, NIPS.

[34]  Raman Arora,et al.  On Learning Rotations , 2009, NIPS.

[35]  Inderjit S. Dhillon,et al.  Low-Rank Kernel Learning with Bregman Matrix Divergences , 2009, J. Mach. Learn. Res..

[36]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[37]  Silvere Bonnabel,et al.  From subspace learning to distance learning: A geometrical optimization approach , 2009, 2009 IEEE/SP 15th Workshop on Statistical Signal Processing.

[38]  Yoshihiro Yamanishi,et al.  Supervised prediction of drug–target interactions using bipartite local models , 2009, Bioinform..

[39]  Ivor W. Tsang,et al.  SimpleNPKL: simple non-parametric kernel learning , 2009, ICML '09.

[40]  Samy Bengio,et al.  Large Scale Online Learning of Image Similarity Through Ranking , 2009, J. Mach. Learn. Res..

[41]  Silvere Bonnabel,et al.  Riemannian Metric and Geometric Mean for Positive Semidefinite Matrices of Fixed Rank , 2008, SIAM J. Matrix Anal. Appl..

[42]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[43]  Sabine Van Huffel,et al.  A Geometric Newton Method for Oja's Vector Field , 2008, Neural Computation.

[44]  Francis R. Bach,et al.  A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization , 2008, J. Mach. Learn. Res..

[45]  S. Yun,et al.  An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems , 2009 .

[46]  S. Yun,et al.  An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems , 2009 .

[47]  Rodolphe Sepulchre,et al.  Geometry and Symmetries in Coordination Control , 2009 .

[48]  Inderjit S. Dhillon,et al.  Online Metric Learning and Fast Similarity Search , 2008, NIPS.

[49]  Inderjit S. Dhillon,et al.  Structured metric learning for high dimensional problems , 2008, KDD.

[50]  Yoshihiro Yamanishi,et al.  Prediction of drug–target interaction networks from the integration of chemical and genomic spaces , 2008, ISMB.

[51]  E.J. Candes,et al.  An Introduction To Compressive Sampling , 2008, IEEE Signal Processing Magazine.

[52]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[53]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[54]  Le Song,et al.  Colored Maximum Variance Unfolding , 2007, NIPS.

[55]  Inderjit S. Dhillon,et al.  Weighted Graph Cuts without Eigenvectors A Multilevel Approach , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Inderjit S. Dhillon,et al.  Matrix Nearness Problems with Bregman Divergences , 2007, SIAM J. Matrix Anal. Appl..

[57]  Pierre-Antoine Absil,et al.  Trust-Region Methods on Riemannian Manifolds , 2007, Found. Comput. Math..

[58]  Manfred K. Warmuth Winnowing subspaces , 2007, ICML '07.

[59]  Shimon Ullman,et al.  Uncovering shared structures in multiclass classification , 2007, ICML '07.

[60]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[61]  Michael I. Jordan,et al.  A Direct Formulation for Sparse Pca Using Semidefinite Programming , 2004, SIAM Rev..

[62]  H. Robbins A Stochastic Approximation Method , 1951 .

[63]  H. V. Trees,et al.  Covariance, Subspace, and Intrinsic CramrRao Bounds , 2007 .

[64]  Nicholas Ayache,et al.  Geometric Means in a Novel Vector Space Structure on Symmetric Positive-Definite Matrices , 2007, SIAM J. Matrix Anal. Appl..

[65]  Lorenzo Torresani,et al.  Large Margin Component Analysis , 2006, NIPS.

[66]  Koby Crammer,et al.  Online Tracking of Linear Subspaces , 2006, COLT.

[67]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[68]  Rong Jin,et al.  Distance Metric Learning: A Comprehensive Survey , 2006 .

[69]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[70]  Amir Globerson,et al.  Metric Learning by Collapsing Classes , 2005, NIPS.

[71]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[72]  Nathan Srebro,et al.  Fast maximum margin matrix factorization for collaborative prediction , 2005, ICML.

[73]  Michael I. Jordan,et al.  Predictive low-rank decomposition for kernel methods , 2005, ICML.

[74]  Shotaro Akaho,et al.  Learning algorithms utilizing quasi-geodesic flows on the Stiefel manifold , 2005, Neurocomputing.

[75]  S.T. Smith,et al.  Covariance, subspace, and intrinsic Crame/spl acute/r-Rao bounds , 2005, IEEE Transactions on Signal Processing.

[76]  John B. Moore,et al.  A Newton-like method for solving rank constrained linear matrix inequalities , 2006, 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No.04CH37601).

[77]  Gunnar Rätsch,et al.  Matrix Exponentiated Gradient Updates for On-line Learning and Bregman Projection , 2004, J. Mach. Learn. Res..

[78]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[79]  Kilian Q. Weinberger,et al.  Learning a kernel matrix for nonlinear dimensionality reduction , 2004, ICML.

[80]  Yoram Singer,et al.  Online and batch learning of pseudo-metrics , 2004, ICML.

[81]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[82]  S. Shankar Sastry,et al.  Optimization Criteria and Geometric Algorithms for Motion and Structure Estimation , 2001, International Journal of Computer Vision.

[83]  P. Absil,et al.  Riemannian Geometry of Grassmann Manifolds with a View on Algorithmic Computation , 2004 .

[84]  Ivor W. Tsang,et al.  Learning with Idealized Kernels , 2003, ICML.

[85]  Renato D. C. Monteiro,et al.  A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization , 2003, Math. Program..

[86]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[87]  E. Petricoin,et al.  Serum proteomic patterns for detection of prostate cancer. , 2002, Journal of the National Cancer Institute.

[88]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[89]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[90]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[91]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[92]  Manfred K. Warmuth,et al.  Boosting as entropy projection , 1999, COLT '99.

[93]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[94]  Jean Pierre Delmas,et al.  Performance analysis of an adaptive algorithm for tracking dominant subspaces , 1998, IEEE Trans. Signal Process..

[95]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[96]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[97]  A. Edelman,et al.  The Geometry of Algorithms with Orthogonality Constraints , 1998, SIAM J. Matrix Anal. Appl..

[98]  Wei-Yong Yan,et al.  Global convergence of Oja's subspace algorithm for principal component extraction , 1998, IEEE Trans. Neural Networks.

[99]  P. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 1999 .

[100]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[101]  Heinz H. Bauschke,et al.  Legendre functions and the method of random Bregman projections , 1997 .

[102]  U. Helmke,et al.  Optimization and Dynamical Systems , 1994, Proceedings of the IEEE.

[103]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[104]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[105]  J. Faraut,et al.  Analysis on Symmetric Cones , 1995 .

[106]  Anthony J. Kearsley,et al.  The Solution of the Metric STRESS and SSTRESS Problems in Multidimensional Scaling Using Newton's Method , 1995 .

[107]  R. Merris Laplacian matrices of graphs: a survey , 1994 .

[108]  M. Trosset,et al.  An optimization problem on subsets of the symmetric positive-semidefinite matrices , 1993 .

[109]  Erkki Oja,et al.  Principal components, minor components, and linear neural networks , 1992, Neural Networks.

[110]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[111]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[112]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[113]  G. Golub Matrix computations , 1983 .

[114]  G. Stewart,et al.  Reorthogonalization and stable algorithms for updating the Gram-Schmidt QR factorization , 1976 .

[115]  W. Boothby An introduction to differentiable manifolds and Riemannian geometry , 1975 .

[116]  D. Luenberger The Gradient Projection Method Along Geodesics , 1972 .

[117]  R. Brockett System Theory on Group Manifolds and Coset Spaces , 1972 .

[118]  Adrien-Marie Legendre,et al.  Nouvelles méthodes pour la détermination des orbites des comètes , 1970 .

[119]  P. Schönemann,et al.  A generalized solution of the orthogonal procrustes problem , 1966 .

[120]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[121]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[122]  Karl Pearson,et al.  ON THE LAW OF ANCESTRAL HEREDITY. , 1903, Science.

[123]  G. Yule On the Theory of Correlation , 1897 .