Efficient Use of Limited-Memory Resources to Accelerate Linear Learning

We propose a generic algorithmic building block to accelerate training of machine learning models on heterogenous compute systems. The scheme allows to efficiently employ compute accelerators such as GPUs and FPGAs for the training of large-scale machine learning models, when the training data exceeds their memory capacity. Also, it provides adaptivity to any system’s memory hierarchy in terms of size and processing speed. Our technique builds upon primal-dual coordinate methods, and uses duality gap information to dynamically decide which part of the data should be made available for fast processing. We provide a strong theoretical motivation for our gap-based selection scheme and provide an efficient practical implementation thereof. To illustrate the power of our approach we demonstrate its performance for training of generalized linear models on large scale datasets exceeding the memory size of a modern GPU, showing an order-of-magnitude speedup over existing approaches.

[1]  Dan Roth,et al.  Selective block minimization for faster convergence of limited memory large-scale linear models , 2011, KDD.

[2]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[3]  Alexander J. Smola,et al.  Linear support vector machines via dual cached loops , 2012, KDD.

[4]  Chih-Jen Lin,et al.  Large Linear Classification When Data Cannot Fit in Memory , 2011, TKDD.

[5]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[6]  Jason Swanson Conditional Expectation , 2014 .

[7]  Mark W. Schmidt,et al.  Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection , 2015, ICML.

[8]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[9]  Peter Richtárik,et al.  Stochastic Dual Coordinate Ascent with Adaptive Probabilities , 2015, ICML.

[10]  Peter Richtárik,et al.  Coordinate descent with arbitrary sampling I: algorithms and complexity† , 2014, Optim. Methods Softw..

[11]  Martin Jaggi,et al.  Primal-Dual Rates and Certificates , 2016, ICML.

[12]  Peter Richtárik,et al.  Optimization in High Dimensions via Accelerated, Parallel, and Proximal Coordinate Descent , 2016, SIAM Rev..

[13]  Brian McWilliams,et al.  DUAL-LOCO: Distributing Statistical Estimation Using Random Projections , 2015, AISTATS.

[14]  Anton Osokin,et al.  Minding the Gaps for Block Frank-Wolfe Optimization of Structured SVMs , 2016, ICML.

[15]  Michael I. Jordan,et al.  CoCoA: A General Framework for Communication-Efficient Distributed Optimization , 2016, J. Mach. Learn. Res..

[16]  Volkan Cevher,et al.  Faster Coordinate Descent via Adaptive Importance Sampling , 2017, AISTATS.

[17]  Haris Pozidis,et al.  Large-Scale Stochastic Learning Using GPUs , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).