DiSCO: Distributed Optimization for Self-Concordant Empirical Loss

We propose a new distributed algorithm for empirical risk minimization in machine learning. The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method. We analyze its iteration complexity and communication efficiency for minimizing self-concordant empirical loss functions, and discuss the results for distributed ridge regression, logistic regression and binary classification with a smoothed hinge loss. In a standard setting for supervised learning, where the n data points are i.i.d. sampled and when the regularization parameter scales as 1/√n show that the proposed algorithm is communication efficient: the required round of communication does not increase with the sample size n, and only grows slowly with the number of machines.

[1]  Gene H. Golub,et al.  Matrix computations , 1983 .

[2]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[3]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[4]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[7]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[8]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[9]  Ohad Shamir,et al.  Stochastic Convex Optimization , 2009, COLT.

[10]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[11]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[12]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[13]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[14]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[15]  John C. Duchi,et al.  Distributed delayed stochastic optimization , 2011, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[16]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[17]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[18]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[19]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[20]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[21]  Cheng-Hao Tsai,et al.  Large-scale logistic regression and linear support vector machines using spark , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[22]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[23]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[24]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[25]  Yuchen Zhang,et al.  Communication-Efficient Distributed Optimization of Self-Concordant Empirical Loss , 2015, ArXiv.

[26]  Chih-Jen Lin,et al.  Distributed Newton Methods for Regularized Logistic Regression , 2015, PAKDD.

[27]  Wotao Yin,et al.  On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers , 2016, J. Sci. Comput..