Bundle CDN: A Highly Parallelized Approach for Large-Scale ℓ1-Regularized Logistic Regression

Parallel coordinate descent algorithms emerge with the growing demand of large-scale optimization. In general, previous algorithms are usually limited by their divergence under high degree of parallelism (DOP), or need data pre-process to avoid divergence. To better exploit parallelism, we propose a coordinate descent based parallel algorithm without needing of data pre-process, termed as Bundle Coordinate Descent Newton (BCDN), and apply it to large-scale l1-regularized logistic regression. BCDN first randomly partitions the feature set into Q non-overlapping subsets/bundles in a Gauss-Seidel manner, where each bundle contains P features. For each bundle, it finds the descent directions for the P features in parallel, and performs P-dimensional Armijo line search to obtain the stepsize. By theoretical analysis on global convergence, we show that BCDN is guaranteed to converge with a high DOP. Experimental evaluations over five public datasets show that BCDN can better exploit parallelism and outperforms state-of-the-art algorithms in speed, without losing testing accuracy.

[1]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[2]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[3]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[4]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[5]  Ambuj Tewari,et al.  Feature Clustering for Accelerating Parallel Coordinate Descent , 2012, NIPS.

[6]  David Kaeli,et al.  Heterogeneous Computing with OpenCL , 2011 .

[7]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[8]  Chia-Hua Ho,et al.  An improved GLMNET for l1-regularized logistic regression , 2011, J. Mach. Learn. Res..

[9]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[10]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[11]  Joseph K. Bradley,et al.  Parallel Coordinate Descent for L1-Regularized Loss Minimization , 2011, ICML.

[12]  S. Osher,et al.  Coordinate descent optimization for l 1 minimization with application to compressed sensing; a greedy algorithm , 2009 .

[13]  Tai Sing Lee,et al.  Stochastic feature mapping for PAC-Bayes classification , 2015, Machine Learning.

[14]  Jian Song,et al.  Parallelized Annealed Particle Filter for real-time marker-less motion tracking via heterogeneous computing , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[15]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[16]  Chih-Jen Lin,et al.  Newton's Method for Large Bound-Constrained Optimization Problems , 1999, SIAM J. Optim..

[17]  James V. Burke,et al.  Descent methods for composite nondifferentiable optimization problems , 1985, Math. Program..

[18]  Tai Sing Lee,et al.  Hybrid generative-discriminative classification using posterior divergence , 2011, CVPR 2011.

[19]  Chih-Jen Lin,et al.  A Comparison of Optimization Methods and Software for Large-scale L1-regularized Linear Classification , 2010, J. Mach. Learn. Res..