GIANT: Globally Improved Approximate Newton Method for Distributed Optimization

For distributed computing environment, we consider the empirical risk minimization problem and propose a distributed and communication-efficient Newton-type optimization method. At every iteration, each worker locally finds an Approximate NewTon (ANT) direction, which is sent to the main driver. The main driver, then, averages all the ANT directions received from workers to form a {\it Globally Improved ANT} (GIANT) direction. GIANT is highly communication efficient and naturally exploits the trade-offs between local computations and global communications in that more local computations result in fewer overall rounds of communications. Theoretically, we show that GIANT enjoys an improved convergence rate as compared with first-order methods and existing distributed Newton-type methods. Further, and in sharp contrast with many existing distributed Newton-type methods, as well as popular first-order methods, a highly advantageous practical feature of GIANT is that it only involves one tuning parameter. We conduct large-scale experiments on a computer cluster and, empirically, demonstrate the superior performance of GIANT.

[1]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[2]  S. Sathiya Keerthi,et al.  An efficient distributed learning algorithm based on effective local functional approximations , 2018, J. Mach. Learn. Res..

[3]  Stephen J. Wright,et al.  An asynchronous parallel stochastic coordinate descent algorithm , 2013, J. Mach. Learn. Res..

[4]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[5]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..

[6]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[7]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[8]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[9]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[10]  S. Muthukrishnan,et al.  Faster least squares approximation , 2007, Numerische Mathematik.

[11]  Ohad Shamir,et al.  Distributed stochastic optimization and learning , 2014, 2014 52nd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[12]  Jacek Gondzio,et al.  Exploiting separability in large-scale linear support vector machine training , 2011, Comput. Optim. Appl..

[13]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[14]  Peng Xu,et al.  Newton-type methods for non-convex optimization under inexact Hessian information , 2017, Math. Program..

[15]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[16]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[17]  Martin J. Wainwright,et al.  Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[18]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2008, Found. Comput. Math..

[19]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[20]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[21]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[22]  Peter Richtárik,et al.  Distributed Coordinate Descent Method for Learning with Big Data , 2013, J. Mach. Learn. Res..

[23]  Shusen Wang,et al.  Sketched Ridge Regression: Optimization Perspective, Statistical Perspective, and Model Averaging , 2017, ICML.

[24]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[25]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[26]  S. Sundararajan,et al.  A Parallel SGD method with Strong Convergence , 2013, ArXiv.

[27]  Uri M. Ascher,et al.  Data completion and stochastic algorithms for PDE inversion problems with many measurements , 2013, ArXiv.

[28]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[29]  Boris T. Polyak,et al.  Newton's method and its use in optimization , 2007, Eur. J. Oper. Res..

[30]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[31]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[32]  Ion Necoara,et al.  Parallel Random Coordinate Descent Method for Composite Minimization: Convergence Analysis and Error Bounds , 2016, SIAM J. Optim..

[33]  Emmanuel J. Candès,et al.  Near-Optimal Signal Recovery From Random Projections: Universal Encoding Strategies? , 2004, IEEE Transactions on Information Theory.

[34]  Peter Richtárik,et al.  Parallel coordinate descent methods for big data optimization , 2012, Mathematical Programming.

[35]  Yuchen Zhang,et al.  DiSCO: Distributed Optimization for Self-Concordant Empirical Loss , 2015, ICML.

[36]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[37]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[38]  Zhihua Zhang,et al.  SPSD Matrix Approximation vis Column Selection: Theories, Algorithms, and Extensions , 2014, J. Mach. Learn. Res..

[39]  Joel A. Tropp,et al.  Improved Analysis of the subsampled Randomized Hadamard Transform , 2010, Adv. Data Sci. Adapt. Anal..

[40]  Ameet Talwalkar,et al.  Federated Multi-Task Learning , 2017, NIPS.

[41]  Peter Richtárik,et al.  Federated Optimization: Distributed Machine Learning for On-Device Intelligence , 2016, ArXiv.

[42]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[43]  Dean P. Foster,et al.  Faster Ridge Regression via the Subsampled Randomized Hadamard Transform , 2013, NIPS.

[44]  James Demmel,et al.  Avoiding communication in primal and dual block coordinate descent methods , 2016, SIAM J. Sci. Comput..

[45]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[46]  Tianbao Yang,et al.  Trading Computation for Communication: Distributed Stochastic Dual Coordinate Ascent , 2013, NIPS.

[47]  Uri M. Ascher,et al.  Assessing stochastic algorithms for large scale nonlinear least squares problems using extremal probabilities of linear combinations of gamma random variables , 2014, SIAM/ASA J. Uncertain. Quantification.

[48]  Sarvar Patel,et al.  Practical Secure Aggregation for Privacy-Preserving Machine Learning , 2017, IACR Cryptol. ePrint Arch..

[49]  Michael W. Mahoney,et al.  Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments , 2015, Proceedings of the IEEE.

[50]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[51]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[52]  Ameet Talwalkar,et al.  MLlib: Machine Learning in Apache Spark , 2015, J. Mach. Learn. Res..

[53]  Alexander J. Smola,et al.  AIDE: Fast and Communication Efficient Distributed Optimization , 2016, ArXiv.

[54]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[55]  Peter Richtárik,et al.  Federated Learning: Strategies for Improving Communication Efficiency , 2016, ArXiv.

[56]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[57]  Tianbao Yang,et al.  Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity , 2015 .

[58]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[59]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[60]  Peter Richtárik,et al.  Optimization in High Dimensions via Accelerated, Parallel, and Proximal Coordinate Descent , 2016, SIAM Rev..

[61]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[62]  Michael I. Jordan,et al.  Adding vs. Averaging in Distributed Primal-Dual Optimization , 2015, ICML.

[63]  Peng Xu,et al.  Sub-sampled Newton Methods with Non-uniform Sampling , 2016, NIPS.

[64]  Blaise Agüera y Arcas,et al.  Communication-Efficient Learning of Deep Networks from Decentralized Data , 2016, AISTATS.

[65]  Tong Zhang,et al.  A General Distributed Dual Coordinate Optimization Framework for Regularized Loss Minimization , 2016, J. Mach. Learn. Res..

[66]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[67]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[68]  Olvi L. Mangasarian,et al.  A finite newton method for classification , 2002, Optim. Methods Softw..

[69]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[70]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[71]  Michael W. Mahoney,et al.  Sub-Sampled Newton Methods I: Globally Convergent Algorithms , 2016, ArXiv.

[72]  Farbod Roosta-Khorasani Randomized Algorithms for Solving Large Scale Nonlinear Least Squares Problems , 2015 .

[73]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[74]  Uri M. Ascher,et al.  Stochastic Algorithms for Inverse Problems Involving PDEs and many Measurements , 2014, SIAM J. Sci. Comput..

[75]  Yves Tillé,et al.  Sampling Algorithms , 2011, International Encyclopedia of Statistical Science.

[76]  Peng Xu,et al.  Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study , 2017, SDM.

[77]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[78]  Alexander J. Smola,et al.  On Variance Reduction in Stochastic Gradient Descent and its Asynchronous Variants , 2015, NIPS.

[79]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[80]  Michael W. Mahoney,et al.  Sub-Sampled Newton Methods II: Local Convergence Rates , 2016, ArXiv.

[81]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.