Parallel Coordinate Descent Newton Method for Efficient $L_{1}$ -Regularized Loss Minimization

The recent years have witnessed advances in parallel algorithms for large-scale optimization problems. Notwithstanding the demonstrated success, existing algorithms that parallelize over features are usually limited by divergence issues under high parallelism or require data preprocessing to alleviate these problems. In this paper, we propose a Parallel Coordinate Descent algorithm using <italic>approximate</italic> Newton steps (PCDN) that is guaranteed to converge globally without data preprocessing. The key component of the PCDN algorithm is the high-dimensional line search, which guarantees the global convergence with high parallelism. The PCDN algorithm randomly partitions the feature set into <inline-formula> <tex-math notation="LaTeX">$b$ </tex-math></inline-formula> subsets/bundles of size <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula>, and sequentially processes each bundle by first computing the descent directions for each feature in parallel and then conducting <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula>-dimensional line search to compute the step size. We show that: 1) the PCDN algorithm is guaranteed to converge globally despite increasing parallelism and 2) the PCDN algorithm converges to the specified accuracy <inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula> within the limited iteration number of <inline-formula> <tex-math notation="LaTeX">$T_\epsilon $ </tex-math></inline-formula>, and <inline-formula> <tex-math notation="LaTeX">$T_\epsilon $ </tex-math></inline-formula> decreases with increasing parallelism. In addition, the data transfer and synchronization cost of the <inline-formula> <tex-math notation="LaTeX">$P$ </tex-math></inline-formula>-dimensional line search can be minimized by maintaining intermediate quantities. For concreteness, the proposed PCDN algorithm is applied to <inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula>-regularized logistic regression and <inline-formula> <tex-math notation="LaTeX">$L_{1}$ </tex-math></inline-formula>-regularized <inline-formula> <tex-math notation="LaTeX">$L_{2}$ </tex-math></inline-formula>-loss support vector machine problems. Experimental evaluations on seven benchmark data sets show that the PCDN algorithm exploits parallelism well and outperforms the state-of-the-art methods.

[1]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[2]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[3]  Chih-Jen Lin,et al.  Newton's Method for Large Bound-Constrained Optimization Problems , 1999, SIAM J. Optim..

[4]  James V. Burke,et al.  Descent methods for composite nondifferentiable optimization problems , 1985, Math. Program..

[5]  Martin Jaggi,et al.  COLA: Communication-Efficient Decentralized Linear Learning , 2018, NIPS 2018.

[6]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[7]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[8]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[9]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[10]  Peter Richtárik,et al.  Distributed Block Coordinate Descent for Minimizing Partially Separable Functions , 2014, 1406.0238.

[11]  Haipeng Luo,et al.  Accelerated Parallel Optimization Methods for Large Scale Machine Learning , 2014, ArXiv.

[12]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[13]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[14]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[15]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[17]  Ambuj Tewari,et al.  Feature Clustering for Accelerating Parallel Coordinate Descent , 2012, NIPS.

[18]  Haibin Ling,et al.  Robust Visual Tracking and Vehicle Classification via Sparse Representation , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[20]  Xiong Li,et al.  Bundle CDN: A Highly Parallelized Approach for Large-Scale ℓ1-Regularized Logistic Regression , 2013, ECML/PKDD.

[21]  Dan Roth,et al.  Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM , 2015, ICML.

[22]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[23]  Ambuj Tewari,et al.  Scaling Up Coordinate Descent Algorithms for Large ℓ1 Regularization Problems , 2012, ICML.

[24]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[25]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[26]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[27]  S. Osher,et al.  Coordinate descent optimization for l 1 minimization with application to compressed sensing; a greedy algorithm , 2009 .

[28]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[29]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..

[30]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[31]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.