Fault Tolerance in Iterative-Convergent Machine Learning

Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative-convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific types of training algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms and use this framework to design new strategies for checkpoint-based fault tolerance. Our framework yields a worst-case upper bound on the iteration cost of arbitrary perturbations to model parameters during training. Our system, SCAR, employs strategies which reduce the iteration cost upper bound due to perturbations incurred when recovering from checkpoints. We show that SCAR can reduce the iteration cost of partial failures by 78% - 95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms.

[1]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[2]  Abutalib Aghayev,et al.  Litz: Elastic Framework for High-Performance Distributed Machine Learning , 2018, USENIX Annual Technical Conference.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[5]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[6]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Eric P. Xing,et al.  High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[9]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[10]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[11]  Mark W. Schmidt,et al.  Convergence Rates of Inexact Proximal-Gradient Methods for Convex Optimization , 2011, NIPS.

[12]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[13]  Yurii Nesterov,et al.  First-order methods of smooth convex optimization with inexact oracle , 2013, Mathematical Programming.

[14]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[15]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[16]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[17]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[18]  Dan Walsh,et al.  Design and implementation of the Sun network filesystem , 1985, USENIX Conference Proceedings.

[19]  Dimitris S. Papailiopoulos,et al.  Perturbed Iterate Analysis for Asynchronous Stochastic Optimization , 2015, SIAM J. Optim..

[20]  Kenneth Y. Goldberg,et al.  Eigentaste: A Constant Time Collaborative Filtering Algorithm , 2001, Information Retrieval.

[21]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[22]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[23]  Rachid Guerraoui,et al.  Machine Learning with Adversaries: Byzantine Tolerant Gradient Descent , 2017, NIPS.

[24]  Jean-Pierre Dussault,et al.  A globally convergent algorithm for MPCC , 2015, EURO J. Comput. Optim..

[25]  Rachid Guerraoui,et al.  Asynchronous Byzantine Machine Learning ( the case of SGD ) Supplementary Material , 2022 .

[26]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[27]  Eric P. Xing,et al.  Managed communication and consistency for fast data-parallel iterative analytics , 2015, SoCC.

[28]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[29]  Lili Su,et al.  Distributed Statistical Machine Learning in Adversarial Settings: Byzantine Gradient Descent , 2017, Abstracts of the 2018 ACM International Conference on Measurement and Modeling of Computer Systems.

[30]  Wotao Yin,et al.  A Globally Convergent Algorithm for Nonconvex Optimization Based on Block Coordinate Update , 2014, J. Sci. Comput..

[31]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[32]  Mahadev Konar,et al.  ZooKeeper: Wait-free Coordination for Internet-scale Systems , 2010, USENIX ATC.

[33]  Seunghak Lee,et al.  Exploiting Bounded Staleness to Speed Up Big Data Analytics , 2014, USENIX Annual Technical Conference.

[34]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[35]  Ran El-Yaniv,et al.  Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[36]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[37]  Gregory R. Ganger,et al.  Proteus: agile ML elasticity through tiered reliability in dynamic resource markets , 2017, EuroSys.

[38]  Prashant Malik,et al.  Cassandra: a decentralized structured storage system , 2010, OPSR.

[39]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[40]  Yoshua Bengio,et al.  Low precision arithmetic for deep learning , 2014, ICLR.

[41]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[42]  Alexandros G. Dimakis,et al.  Gradient Coding: Avoiding Stragglers in Distributed Learning , 2017, ICML.

[43]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[44]  Suhas N. Diggavi,et al.  Straggler Mitigation in Distributed Optimization Through Data Encoding , 2017, NIPS.

[45]  Zhe Wang,et al.  Efficient top-K query calculation in distributed networks , 2004, PODC '04.

[46]  Seunghak Lee,et al.  Solving the Straggler Problem with Bounded Staleness , 2013, HotOS.

[47]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[48]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[49]  Mei Han An,et al.  accuracy and stability of numerical algorithms , 1991 .

[50]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[51]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[52]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[53]  Chih-Jen Lin,et al.  Field-aware Factorization Machines for CTR Prediction , 2016, RecSys.

[54]  Rachid Guerraoui,et al.  The Hidden Vulnerability of Distributed Learning in Byzantium , 2018, ICML.

[55]  Eric P. Xing,et al.  Addressing the straggler problem for iterative convergent parallel ML , 2016, SoCC.

[56]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[57]  Lifeng Lai,et al.  On randomized distributed coordinate descent with quantized updates , 2017, 2017 51st Annual Conference on Information Sciences and Systems (CISS).

[58]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[59]  Dimitris S. Papailiopoulos,et al.  Speeding up distributed machine learning using codes , 2016, ISIT.

[60]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[61]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[62]  Carlos Maltzahn,et al.  Ceph: a scalable, high-performance distributed file system , 2006, OSDI '06.