Speeding up distributed machine learning using codes

Distributed machine learning algorithms that are widely run on modern large-scale computing platforms face several types of randomness, uncertainty and system “noise.” These include stragglers1, system failures, maintenance outages, and communication bottlenecks. In this work, we view distributed machine learning algorithms through a coding-theoretic lens, and show how codes can equip them with robustness against this system noise. Motivated by their importance and universality, we focus on two of the most basic building blocks of distributed learning algorithms: data shuffling and matrix multiplication. In data shuffling, we use codes to reduce communication bottlenecks: when a constant fraction of the data can be cached at each worker node, and n is the number of workers, coded shuffling reduces the communication cost by up to a factor Θ(n) over uncoded shuffling. For matrix multiplication, we use codes to alleviate the effects of stragglers, also known as the straggler problem. We show that if the number of workers is n, and the runtime of each subtask has an exponential tail, the optimal coded matrix multiplication is Θ(log n) times faster than the uncoded matrix multiplication or the optimal task replication scheme.

[1]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[2]  Daniel A. Spielman,et al.  Highly Fault-Tolerant Parallel Computation Extended Abstract* , 1996, FOCS 1996.

[3]  G.C. Verghese,et al.  Fault-tolerant linear finite state machines , 1999, ICECS'99. Proceedings of ICECS '99. 6th IEEE International Conference on Electronics, Circuits and Systems (Cat. No.99EX357).

[4]  Shu Lin,et al.  Error Control Coding , 2004 .

[5]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[6]  John E. Savage,et al.  A framework for coded computation , 2008, 2008 IEEE International Symposium on Information Theory.

[7]  Paulo Cortez,et al.  Modeling wine preferences by data mining from physicochemical properties , 2009, Decis. Support Syst..

[8]  Yunnan Wu,et al.  Network coding for distributed storage systems , 2010, IEEE Trans. Inf. Theory.

[9]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[10]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[11]  Cheng Huang,et al.  Erasure Coding in Windows Azure Storage , 2012, USENIX Annual Technical Conference.

[12]  Dimitris S. Papailiopoulos,et al.  XORing Elephants: Novel Erasure Codes for Big Data , 2013, Proc. VLDB Endow..

[13]  Kannan Ramchandran,et al.  A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster , 2013, HotStorage.

[14]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[15]  Nihar B. Shah,et al.  When do redundant requests reduce latency ? , 2013, 2013 51st Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[16]  Christopher Ré,et al.  Parallel stochastic gradient algorithms for large-scale matrix completion , 2013, Mathematical Programming Computation.

[17]  Christopher Ré,et al.  DimmWitted: A Study of Main-Memory Statistical Analytics , 2014, Proc. VLDB Endow..

[18]  Ulas C. Kozat,et al.  TOFEC: Achieving optimal throughput-delay trade-off of cloud storage using erasure codes , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[19]  Urs Niesen,et al.  Fundamental Limits of Caching , 2014, IEEE Trans. Inf. Theory.

[20]  Gregory W. Wornell,et al.  Efficient task replication for fast response times in parallel computation , 2014, SIGMETRICS '14.

[21]  Mor Harchol-Balter,et al.  Reducing Latency via Redundant Requests: Exact Analysis , 2015, SIGMETRICS 2015.