A distributed Frank–Wolfe framework for learning low-rank matrices with the trace norm

We consider the problem of learning a high-dimensional but low-rank matrix from a large-scale dataset distributed over several machines, where low-rankness is enforced by a convex trace norm constraint. We propose DFW-Trace, a distributed Frank–Wolfe algorithm which leverages the low-rank structure of its updates to achieve efficiency in time, memory and communication usage. The step at the heart of DFW-Trace is solved approximately using a distributed version of the power method. We provide a theoretical analysis of the convergence of DFW-Trace, showing that we can ensure sublinear convergence in expectation to an optimal solution with few power iterations per epoch. We implement DFW-Trace in the Apache Spark distributed programming framework and validate the usefulness of our approach on synthetic and real data, including the ImageNet dataset with high-dimensional features extracted from a deep neural network.

[1]  Maria-Florina Balcan,et al.  A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning , 2014, SDM.

[2]  Yi Zhou,et al.  Conditional Gradient Sliding for Convex Optimization , 2016, SIAM J. Optim..

[3]  Jos F. Sturm,et al.  A Matlab toolbox for optimization over symmetric cones , 1999 .

[4]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[5]  R. Bro,et al.  Resolving the sign ambiguity in the singular value decomposition , 2008 .

[6]  Francis R. Bach,et al.  Consistency of trace norm minimization , 2007, J. Mach. Learn. Res..

[7]  Matthijs Douze,et al.  Large-scale image classification with trace-norm regularization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Alexandre Bernardino,et al.  Matrix Completion for Multi-label Image Classification , 2011, NIPS.

[9]  Stephen Becker,et al.  Quantum state tomography via compressed sensing. , 2009, Physical review letters.

[10]  HarchaouiZaid,et al.  Conditional gradient algorithms for norm-regularized smooth convex optimization , 2015 .

[11]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[12]  Anna Scaglione,et al.  Decentralized Frank–Wolfe Algorithm for Convex and Nonconvex Problems , 2016, IEEE Transactions on Automatic Control.

[13]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[15]  Paul Tseng,et al.  Trace Norm Regularization: Reformulations, Algorithms, and Multi-Task Learning , 2010, SIAM J. Optim..

[16]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[17]  Martin Jaggi,et al.  A Simple Algorithm for Nuclear Norm Regularized Problems , 2010, ICML.

[18]  M. Bagirov Optimization Methods and Software , 2005 .

[19]  Yonina C. Eldar,et al.  Phase Retrieval via Matrix Completion , 2011, SIAM Rev..

[20]  Elad Hazan,et al.  Sparse Approximate Solutions to Semidefinite Programs , 2008, LATIN.

[21]  Ashwini Maurya LARGE-SCALE IMAGE CLASSIFICATION , 2013 .

[22]  Eric P. Xing,et al.  Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms , 2014, ICML.

[23]  October I Physical Review Letters , 2022 .

[24]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[25]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[26]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[27]  Elad Hazan,et al.  Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets , 2014, ICML.

[28]  Kim-Chuan Toh,et al.  SDPT3 -- A Matlab Software Package for Semidefinite Programming , 1996 .

[29]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[30]  Emmanuel J. Candès,et al.  The Power of Convex Relaxation: Near-Optimal Matrix Completion , 2009, IEEE Transactions on Information Theory.

[31]  Robert D. Nowak,et al.  Transduction with Matrix Completion: Three Birds with One Stone , 2010, NIPS.

[32]  Alexandre Bernardino,et al.  Unifying Nuclear Norm and Bilinear Factorization Approaches for Low-Rank Matrix Decomposition , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.

[34]  Stratis Ioannidis,et al.  Distributing Frank-Wolfe via Map-Reduce , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[35]  Ivor W. Tsang,et al.  Approximate Conditional Gradient Descent on Multi-Class Classification , 2017, AAAI.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Haipeng Luo,et al.  Variance-Reduced and Projection-Free Stochastic Optimization , 2016, ICML.

[38]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[39]  Nam-Luc Tran,et al.  Distributed frank-wolfe under pipelined stale synchronous parallelism , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[40]  Paul Grigas,et al.  New analysis and results for the Frank–Wolfe method , 2013, Mathematical Programming.

[41]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[42]  Zuowei Shen,et al.  Robust video denoising using low rank matrix completion , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[43]  Shiqian Ma,et al.  Fixed point and Bregman iterative methods for matrix rank minimization , 2009, Math. Program..

[44]  Ameet Talwalkar,et al.  Divide-and-Conquer Matrix Factorization , 2011, NIPS.

[45]  Shimon Ullman,et al.  Uncovering shared structures in multiclass classification , 2007, ICML '07.

[46]  J. Kuczy,et al.  Estimating the Largest Eigenvalue by the Power and Lanczos Algorithms with a Random Start , 1992 .

[47]  Benjamin Recht,et al.  A Simpler Approach to Matrix Completion , 2009, J. Mach. Learn. Res..

[48]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[49]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[50]  Zaïd Harchaoui,et al.  Conditional gradient algorithms for norm-regularized smooth convex optimization , 2013, Math. Program..

[51]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[52]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[53]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[54]  Eric Moulines,et al.  Fast and privacy preserving distributed low-rank regression , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[55]  Martin Jaggi,et al.  On the Global Linear Convergence of Frank-Wolfe Optimization Variants , 2015, NIPS.

[56]  V. Koltchinskii,et al.  Nuclear norm penalization and optimal rates for noisy low rank matrix completion , 2010, 1011.6256.

[57]  David Gross,et al.  Recovering Low-Rank Matrices From Few Coefficients in Any Basis , 2009, IEEE Transactions on Information Theory.

[58]  Henryk Wozniakowski,et al.  Estimating the Largest Eigenvalue by the Power and Lanczos Algorithms with a Random Start , 1992, SIAM J. Matrix Anal. Appl..