Distributing Frank–Wolfe via map-reduce

Large-scale optimization problems abound in data mining and machine learning applications, and the computational challenges they pose are often addressed through parallelization. We identify structural properties under which a convex optimization problem can be massively parallelized via map-reduce operations using the Frank–Wolfe (FW) algorithm. The class of problems that can be tackled this way is quite broad and includes experimental design, AdaBoost, and projection to a convex hull. Implementing FW via map-reduce eases parallelization and deployment via commercial distributed computing frameworks. We demonstrate this by implementing FW over Spark, an engine for parallel data processing, and establish that parallelization through map-reduce yields significant performance improvements: We solve problems with 20 million variables using 350 cores in 79 min; the same operation takes 48 h when executed serially.

[1]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[4]  Yi Zhou,et al.  Conditional Gradient Sliding for Convex Optimization , 2016, SIAM J. Optim..

[5]  Parikshit Shah,et al.  Linear system identification via atomic norm regularization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[6]  Dinh Phung,et al.  Journal of Machine Learning Research: Preface , 2014 .

[7]  Gang Wang,et al.  Randomized Block Frank–Wolfe for Convergent Large-Scale Learning , 2016, IEEE Transactions on Signal Processing.

[8]  Philip Wolfe,et al.  An algorithm for quadratic programming , 1956 .

[9]  Zaïd Harchaoui,et al.  Lifted coordinate descent for learning with trace-norm regularization , 2012, AISTATS.

[10]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[11]  Maria-Florina Balcan,et al.  A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning , 2014, SDM.

[12]  Ofer Meshi,et al.  Linear-Memory and Decomposition-Invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes , 2016, NIPS.

[13]  J. Dunn Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals , 1979, 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Shimrit Shtern,et al.  Linearly convergent away-step conditional gradient for non-strongly convex functions , 2015, Mathematical Programming.

[16]  Elad Hazan,et al.  Projection-free Online Learning , 2012, ICML.

[17]  Patrice Marcotte,et al.  Some comments on Wolfe's ‘away step’ , 1986, Math. Program..

[18]  Pradeep Ravikumar,et al.  Greedy Algorithms for Structurally Constrained High Dimensional Problems , 2011, NIPS.

[19]  Sergei Vassilvitskii,et al.  Counting triangles and the curse of the last reducer , 2011, WWW.

[20]  Stratis Ioannidis,et al.  Distributing Frank-Wolfe via Map-Reduce , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[21]  M. I. Rosenberg,et al.  Naval Research Logistics Quarterly. , 1958 .

[22]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[23]  Yijie Wang,et al.  Stochastic block coordinate Frank-Wolfe algorithm for large-scale biological network alignment , 2016, EURASIP J. Bioinform. Syst. Biol..

[24]  Matthijs Douze,et al.  Large-scale image classification with trace-norm regularization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Andreas Krause,et al.  Guaranteed Non-convex Optimization: Submodular Maximization over Continuous Domains , 2016, AISTATS.

[26]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[27]  Anton Osokin,et al.  Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs , 2016, ICML.

[28]  E T. Leighton,et al.  Introduction to parallel algorithms and architectures , 1991 .

[29]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[30]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[31]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[32]  Eric P. Xing,et al.  Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms , 2014, ICML.

[33]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[34]  Mark W. Schmidt,et al.  Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[35]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[36]  Tianbao Yang,et al.  Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent , 2013, NIPS.

[37]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[38]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[39]  J. Sherman,et al.  Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .

[40]  Peng Li,et al.  Distance Metric Learning with Eigenvalue Optimization , 2012, J. Mach. Learn. Res..

[41]  Fei-Fei Li,et al.  Efficient Image and Video Co-localization with Frank-Wolfe Algorithm , 2014, ECCV.

[42]  Martin Jaggi,et al.  Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[43]  Pablo A. Parrilo,et al.  The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[44]  M. Canon,et al.  A Tight Upper Bound on the Rate of Convergence of Frank-Wolfe Algorithm , 1968 .

[45]  Arindam Banerjee,et al.  Structured Estimation with Atomic Norms: General Bounds and Applications , 2015, NIPS.

[46]  Alexander J. Smola,et al.  Stochastic Frank-Wolfe methods for nonconvex optimization , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[47]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[48]  Sergei Vassilvitskii,et al.  Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.

[49]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[50]  Jan Vondrák,et al.  Maximizing a Monotone Submodular Function Subject to a Matroid Constraint , 2011, SIAM J. Comput..

[51]  Kenneth L. Clarkson,et al.  Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[52]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[53]  Haipeng Luo,et al.  Variance-Reduced and Projection-Free Stochastic Optimization , 2016, ICML.

[54]  DAN GARBER,et al.  A Linearly Convergent Variant of the Conditional Gradient Algorithm under Strong Convexity, with Applications to Online and Stochastic Optimization , 2016, SIAM J. Optim..

[55]  Nam-Luc Tran,et al.  Distributed frank-wolfe under pipelined stale synchronous parallelism , 2015, 2015 IEEE International Conference on Big Data (Big Data).