论文信息 - On Convergence of Model Parallel Proximal Gradient Algorithm for Stale Synchronous Parallel System

On Convergence of Model Parallel Proximal Gradient Algorithm for Stale Synchronous Parallel System

Theorem 1 (Asymptotic consistency). Let Assumption 1 and 2 hold, and apply msPG to problem (P). If the step size η < (Lf + 2Ls) −1, then the global model and local models satisfy: 1. ∑∞ t=0 ‖x(t+ 1)− x(t)‖ <∞; 2. lim t→∞ ‖x(t+ 1)− x(t)‖ = 0, lim t→∞ ‖x(t)− x(t)‖ = 0; 3. The limit points ω({x(t)}) = ω({x(t)}) ⊆ critF . Proof. We start from bounding the difference between the global model x and the local model x (on any machine i). Indeed, at iteration t, by the definition of the global and local models in msPG: ‖x(t)− x(t)‖ = √√√√ p ∑

[1] Mark W. Schmidt,et al. Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[2] Seunghak Lee,et al. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[3] Xiaotong Shen,et al. Simultaneous Grouping Pursuit and Feature Selection Over an Undirected Graph , 2013, Journal of the American Statistical Association.

[4] Paul Tseng,et al. On the Rate of Convergence of a Partially Asynchronous Gradient Projection Algorithm , 1991, SIAM J. Optim..

[5] Robert E. Mahony,et al. Convergence of the Iterates of Descent Methods for Analytic Cost Functions , 2005, SIAM J. Optim..

[6] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7] Adrian S. Lewis,et al. The [barred L]ojasiewicz Inequality for Nonsmooth Subanalytic Functions with Applications to Subgradient Dynamical Systems , 2006, SIAM J. Optim..

[8] John N. Tsitsiklis,et al. Convergence rate and termination of asynchronous iterative algorithms , 1989, ICS '89.

[9] Hédy Attouch,et al. Proximal Alternating Minimization and Projection Methods for Nonconvex Problems: An Approach Based on the Kurdyka-Lojasiewicz Inequality , 2008, Math. Oper. Res..

[10] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[11] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[12] Eric P. Xing,et al. High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[13] Peter Richtárik,et al. Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[14] Bastian Goldlücke,et al. Variational Analysis , 2014, Computer Vision, A Reference Guide.

[15] John N. Tsitsiklis,et al. Parallel and distributed computation , 1989 .

[16] Cun-Hui Zhang. Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[17] Marc Teboulle,et al. Proximal alternating linearized minimization for nonconvex and nonsmooth problems , 2013, Mathematical Programming.

[18] John C. Duchi,et al. Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[19] John N. Tsitsiklis,et al. Distributed Asynchronous Deterministic and Stochastic Gradient Optimization Algorithms , 1984, 1984 American Control Conference.

[20] Hamid Reza Feyzmahdavian,et al. On the convergence rates of asynchronous iterations , 2014, 53rd IEEE Conference on Decision and Control.

[21] Carlos Guestrin,et al. Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[22] Aaron Q. Li,et al. Parameter Server for Distributed Machine Learning , 2013 .

[23] T. Hastie,et al. SparseNet: Coordinate Descent With Nonconvex Penalties , 2011, Journal of the American Statistical Association.

[24] Jianqing Fan,et al. Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[25] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.

[26] Hamid Reza Feyzmahdavian,et al. A delayed proximal gradient method with linear convergence rate , 2014, 2014 IEEE International Workshop on Machine Learning for Signal Processing (MLSP).

[27] Zhaoran Wang,et al. OPTIMAL COMPUTATIONAL AND STATISTICAL RATES OF CONVERGENCE FOR SPARSE NONCONVEX LEARNING PROBLEMS. , 2013, Annals of statistics.

[28] Marc Teboulle,et al. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[29] Yaoliang Yu,et al. Minimizing Nonconvex Non-Separable Functions , 2015, AISTATS.

[30] Hédy Attouch,et al. On the convergence of the proximal algorithm for nonsmooth functions involving analytic features , 2008, Math. Program..

[31] Stephen J. Wright,et al. Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties , 2014, SIAM J. Optim..

[32] Stephen J. Wright,et al. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[33] Tong Zhang,et al. A General Theory of Concave Regularization for High-Dimensional Sparse Estimation Problems , 2011, 1108.4988.

[34] M. Fukushima,et al. A generalized proximal point algorithm for certain non-convex minimization problems , 1981 .

[35] K. Kurdyka. On gradients of functions definable in o-minimal structures , 1998 .

[36] Y. She,et al. Thresholding-based iterative selection procedures for model selection and shrinkage , 2008, 0812.5061.

[37] Stephen P. Boyd,et al. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[38] J. Bolte,et al. Characterizations of Lojasiewicz inequalities and applications , 2008, 0802.0826.

[39] Benar Fux Svaiter,et al. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods , 2013, Math. Program..