论文信息 - Efficient and Programmable Distributed Shared Memory Systems for Machine Learning Training

Efficient and Programmable Distributed Shared Memory Systems for Machine Learning Training

Machine learning training involves frequent and often sparse updates to a large number of numerical values called model parameters. Many distributed training systems have resorted to using distributed shared memory (DSM) (e.g. Parameter Server) for efficient sparse access and in-place updates. Compared to traditional programs, machine learning applications tolerate bounded error, which presents opportunities for trading off learning progress for higher computation throughput. In this thesis, I develop efficient and programmable distributed learning systems, by exploiting this trade-off in the design of distributed shared memory systems, along with parallelization and static and dynamic scheduling. Thanks to this tolerance to bounded error, a machine learning program can often be parallelized without strictly preserving data dependence. Parallel workers may thus observe inconsistent model parameter values compared to a serial execution. More frequent communication to propagate updates and fresher parameter values may reduce such inconsistency, while incurring higher communication overhead. I present a communication management mechanism that automates communication using spare network bandwidth and prioritizes messages according to their importance in order to reduce error due to inconsistency while retaining high computation throughput. When each model update reads and writes to only a subset of model parameters, it is possible to achieve an efficient parallelization while preserving critical data dependence, exploiting sparse parameter access. Existing systems require substantial programmer effort to take advantage of this opportunity. In order to achieve dependence-preserving parallelization without imposing a huge burden on application programmers, I present a system Orion that provides parallel for-loops on distributed shared memory and parallelizes loops with loop-carried dependence. At last, I propose to explore dynamic scheduling for dynamic control flow in dataflow systems such as TensorFlow. In TensorFlow, the computation graph is statically partitioned and assigned with computation devices. Static device placement is suboptimal as the operators’ load can no longer be determined statically due to dynamic control flow. A suboptimal static device placement may result in imbalanced load and extra communication. It is promising to address the deficiency of static device placement by dynamically scheduling operations based on their load at runtime. August 16, 2018 DRAFT

Jinliang Wei

[1] Joseph K. Bradley,et al. Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale , 2016, NIPS.

[2] Peter Norvig,et al. Deep Learning with Dynamic Computation Graphs , 2017, ICLR.

[3] Alexander J. Smola,et al. Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[4] Alan Edelman,et al. Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[5] Garth A. Gibson,et al. PRObE: A Thousand-Node Experimental Cluster for Computer Systems Research , 2013, login Usenix Mag..

[6] Seunghak Lee,et al. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[7] Eric P. Xing,et al. Exploiting iterative-ness for parallel ML computations , 2014, SoCC.

[8] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[9] Tianqi Chen,et al. XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[10] Weimin Zheng,et al. Exploring the Hidden Dimension in Graph Processing , 2016, OSDI.

[11] Seunghak Lee,et al. STRADS: a distributed framework for scheduled model parallel machine learning , 2016, EuroSys.