论文信息 - Distributing Frank–Wolfe via map-reduce

Distributing Frank–Wolfe via map-reduce

Large-scale optimization problems abound in data mining and machine learning applications, and the computational challenges they pose are often addressed through parallelization. We identify structural properties under which a convex optimization problem can be massively parallelized via map-reduce operations using the Frank–Wolfe (FW) algorithm. The class of problems that can be tackled this way is quite broad and includes experimental design, AdaBoost, and projection to a convex hull. Implementing FW via map-reduce eases parallelization and deployment via commercial distributed computing frameworks. We demonstrate this by implementing FW over Spark, an engine for parallel data processing, and establish that parallelization through map-reduce yields significant performance improvements: We solve problems with 20 million variables using 350 cores in 79 min; the same operation takes 48 h when executed serially.

Stratis Ioannidis | Armin Moharrer | Stratis Ioannidis | Armin Moharrer

[1] Douglas Stott Parker,et al. Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2] Tom White,et al. Hadoop: The Definitive Guide , 2009 .

[3] Alexander J. Smola,et al. Parallelized Stochastic Gradient Descent , 2010, NIPS.

[4] Yi Zhou,et al. Conditional Gradient Sliding for Convex Optimization , 2016, SIAM J. Optim..

[5] Parikshit Shah,et al. Linear system identification via atomic norm regularization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[6] Dinh Phung,et al. Journal of Machine Learning Research: Preface , 2014 .

[7] Gang Wang,et al. Randomized Block Frank–Wolfe for Convergent Large-Scale Learning , 2016, IEEE Transactions on Signal Processing.

[8] Philip Wolfe,et al. An algorithm for quadratic programming , 1956 .

[9] Zaïd Harchaoui,et al. Lifted coordinate descent for learning with trace-norm regularization , 2012, AISTATS.

[10] Dimitri P. Bertsekas,et al. Nonlinear Programming , 1997 .

[11] Maria-Florina Balcan,et al. A Distributed Frank-Wolfe Algorithm for Communication-Efficient Sparse Learning , 2014, SDM.

[12] Ofer Meshi,et al. Linear-Memory and Decomposition-Invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes , 2016, NIPS.

[13] J. Dunn. Rates of convergence for conditional gradient algorithms near singular and nonsingular extremals , 1979, 1978 IEEE Conference on Decision and Control including the 17th Symposium on Adaptive Processes.

[14] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15] Shimrit Shtern,et al. Linearly convergent away-step conditional gradient for non-strongly convex functions , 2015, Mathematical Programming.

[16] Elad Hazan,et al. Projection-free Online Learning , 2012, ICML.

[17] Patrice Marcotte,et al. Some comments on Wolfe's ‘away step’ , 1986, Math. Program..

[18] Pradeep Ravikumar,et al. Greedy Algorithms for Structurally Constrained High Dimensional Problems , 2011, NIPS.

[19] Sergei Vassilvitskii,et al. Counting triangles and the curse of the last reducer , 2011, WWW.

[20] Stratis Ioannidis,et al. Distributing Frank-Wolfe via Map-Reduce , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[21] M. I. Rosenberg,et al. Naval Research Logistics Quarterly. , 1958 .

[22] Kunle Olukotun,et al. Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[23] Yijie Wang,et al. Stochastic block coordinate Frank-Wolfe algorithm for large-scale biological network alignment , 2016, EURASIP J. Bioinform. Syst. Biol..

[24] Matthijs Douze,et al. Large-scale image classification with trace-norm regularization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Andreas Krause,et al. Guaranteed Non-convex Optimization: Submodular Maximization over Continuous Domains , 2016, AISTATS.

[26] A. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[27] Anton Osokin,et al. Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs , 2016, ICML.

[28] E T. Leighton,et al. Introduction to parallel algorithms and architectures , 1991 .

[29] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[30] F. Leighton,et al. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[31] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[32] Eric P. Xing,et al. Parallel and Distributed Block-Coordinate Frank-Wolfe Algorithms , 2014, ICML.

[33] F. Maxwell Harper,et al. The MovieLens Datasets: History and Context , 2016, TIIS.

[34] Mark W. Schmidt,et al. Block-Coordinate Frank-Wolfe Optimization for Structural SVMs , 2012, ICML.

[35] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[36] Tianbao Yang,et al. Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent , 2013, NIPS.

[37] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[38] Sergei Vassilvitskii,et al. Scalable K-Means++ , 2012, Proc. VLDB Endow..

[39] J. Sherman,et al. Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .

[40] Peng Li,et al. Distance Metric Learning with Eigenvalue Optimization , 2012, J. Mach. Learn. Res..

[41] Fei-Fei Li,et al. Efficient Image and Video Co-localization with Frank-Wolfe Algorithm , 2014, ECCV.

[42] Martin Jaggi,et al. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization , 2013, ICML.

[43] Pablo A. Parrilo,et al. The Convex Geometry of Linear Inverse Problems , 2010, Foundations of Computational Mathematics.

[44] M. Canon,et al. A Tight Upper Bound on the Rate of Convergence of Frank-Wolfe Algorithm , 1968 .

[45] Arindam Banerjee,et al. Structured Estimation with Atomic Norms: General Bounds and Applications , 2015, NIPS.

[46] Alexander J. Smola,et al. Stochastic Frank-Wolfe methods for nonconvex optimization , 2016, 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[47] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[48] Sergei Vassilvitskii,et al. Fast greedy algorithms in mapreduce and streaming , 2013, SPAA.

[49] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[50] Jan Vondrák,et al. Maximizing a Monotone Submodular Function Subject to a Matroid Constraint , 2011, SIAM J. Comput..

[51] Kenneth L. Clarkson,et al. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm , 2008, SODA '08.

[52] Yehuda Koren,et al. Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[53] Haipeng Luo,et al. Variance-Reduced and Projection-Free Stochastic Optimization , 2016, ICML.

[54] DAN GARBER,et al. A Linearly Convergent Variant of the Conditional Gradient Algorithm under Strong Convexity, with Applications to Online and Stochastic Optimization , 2016, SIAM J. Optim..

[55] Nam-Luc Tran,et al. Distributed frank-wolfe under pipelined stale synchronous parallelism , 2015, 2015 IEEE International Conference on Big Data (Big Data).