On Convergence of Model Parallel Proximal Gradient Algorithm for Stale Synchronous Parallel System

Theorem 1 (Asymptotic consistency). Let Assumption 1 and 2 hold, and apply msPG to problem (P). If the step size η < (Lf + 2Ls) −1, then the global model and local models satisfy: 1. ∑∞ t=0 ‖x(t+ 1)− x(t)‖ <∞; 2. lim t→∞ ‖x(t+ 1)− x(t)‖ = 0, lim t→∞ ‖x(t)− x(t)‖ = 0; 3. The limit points ω({x(t)}) = ω({x(t)}) ⊆ critF . Proof. We start from bounding the difference between the global model x and the local model x (on any machine i). Indeed, at iteration t, by the definition of the global and local models in msPG: ‖x(t)− x(t)‖ = √√√√ p ∑

[1]  Mark W. Schmidt,et al.  Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[2]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[3]  Xiaotong Shen,et al.  Simultaneous Grouping Pursuit and Feature Selection Over an Undirected Graph , 2013, Journal of the American Statistical Association.

[4]  Paul Tseng,et al.  On the Rate of Convergence of a Partially Asynchronous Gradient Projection Algorithm , 1991, SIAM J. Optim..

[5]  Robert E. Mahony,et al.  Convergence of the Iterates of Descent Methods for Analytic Cost Functions , 2005, SIAM J. Optim..

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  Adrian S. Lewis,et al.  The [barred L]ojasiewicz Inequality for Nonsmooth Subanalytic Functions with Applications to Subgradient Dynamical Systems , 2006, SIAM J. Optim..

[8]  John N. Tsitsiklis,et al.  Convergence rate and termination of asynchronous iterative algorithms , 1989, ICS '89.

[9]  Hédy Attouch,et al.  Proximal Alternating Minimization and Projection Methods for Nonconvex Problems: An Approach Based on the Kurdyka-Lojasiewicz Inequality , 2008, Math. Oper. Res..

[10]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[11]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[12]  Eric P. Xing,et al.  High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[13]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[14]  Bastian Goldlücke,et al.  Variational Analysis , 2014, Computer Vision, A Reference Guide.

[15]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[16]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[17]  Marc Teboulle,et al.  Proximal alternating linearized minimization for nonconvex and nonsmooth problems , 2013, Mathematical Programming.

[18]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[19]  John N. Tsitsiklis,et al.  Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[20]  Hamid Reza Feyzmahdavian,et al.  On the convergence rates of asynchronous iterations , 2014, 53rd IEEE Conference on Decision and Control.

[21]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[22]  Aaron Q. Li,et al.  Parameter Server for Distributed Machine Learning , 2013 .

[23]  T. Hastie,et al.  SparseNet: Coordinate Descent With Nonconvex Penalties , 2011, Journal of the American Statistical Association.

[24]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[25]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[26]  Hamid Reza Feyzmahdavian,et al.  A delayed proximal gradient method with linear convergence rate , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[27]  Zhaoran Wang,et al.  OPTIMAL COMPUTATIONAL AND STATISTICAL RATES OF CONVERGENCE FOR SPARSE NONCONVEX LEARNING PROBLEMS. , 2013, Annals of statistics.

[28]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[29]  Yaoliang Yu,et al.  Minimizing Nonconvex Non-Separable Functions , 2015, AISTATS.

[30]  Hédy Attouch,et al.  On the convergence of the proximal algorithm for nonsmooth functions involving analytic features , 2008, Math. Program..

[31]  Stephen J. Wright,et al.  Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties , 2014, SIAM J. Optim..

[32]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[33]  Tong Zhang,et al.  A General Theory of Concave Regularization for High-Dimensional Sparse Estimation Problems , 2011, 1108.4988.

[34]  M. Fukushima,et al.  A generalized proximal point algorithm for certain non-convex minimization problems , 1981 .

[35]  K. Kurdyka On gradients of functions definable in o-minimal structures , 1998 .

[36]  Y. She,et al.  Thresholding-based iterative selection procedures for model selection and shrinkage , 2008, 0812.5061.

[37]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[38]  J. Bolte,et al.  Characterizations of Lojasiewicz inequalities and applications , 2008, 0802.0826.

[39]  Benar Fux Svaiter,et al.  Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods , 2013, Math. Program..