论文信息 - Dynamic parameter allocation in parameter servers

Dynamic parameter allocation in parameter servers

To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in distributed training---, but can induce severe communication overhead. To reduce communication overhead, distributed machine learning algorithms use techniques to increase parameter access locality (PAL), achieving up to linear speed-ups. We found that existing parameter servers provide only limited support for PAL techniques, however, and therefore prevent efficient training. In this paper, we explore whether and to what extent PAL techniques can be supported, and whether such support is beneficial. We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse, and experimentally compare its performance to existing parameter servers across a number of machine learning tasks. We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers.

Volker Markl | Steffen Zeuch | Rainer Gemulla | Alexander Renz-Wieland

[1] Jason Weston,et al. Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[2] Jens Lehmann,et al. DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[3] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[4] Zhipeng Zhang,et al. PS2: Parameter Server on Spark , 2019, SIGMOD Conference.

[5] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[6] Evgeniy Gabrilovich,et al. A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[7] Carlos Guestrin,et al. Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[8] Hans-Peter Kriegel,et al. A Three-Way Model for Collective Learning on Multi-Relational Data , 2011, ICML.

[9] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[10] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[11] Yaoliang Yu,et al. Petuum: A New Platform for Distributed Machine Learning on Big Data , 2013, IEEE Transactions on Big Data.

[12] Christos Faloutsos,et al. FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop , 2014, SDM.

[13] S. V. N. Vishwanathan,et al. Scaling Multinomial Logistic Regression via Hybrid Parallelism , 2019, KDD.

[14] Jiawei Jiang,et al. Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[15] Yehuda Koren,et al. Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[16] Yiming Yang,et al. Analogical Inference for Multi-relational Embeddings , 2017, ICML.

[17] Binyu Zang,et al. PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[18] Mustaque Ahamad,et al. Slow memory: weakening consistency to enhance concurrency in distributed shared memories , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[19] Guillaume Bouchard,et al. Complex Embeddings for Simple Link Prediction , 2016, ICML.

[20] Alexander J. Smola,et al. Scalable inference in latent variable models , 2012, WSDM '12.

[21] Inderjit S. Dhillon,et al. A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[22] Peter J. Haas,et al. Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion , 2013, Knowledge and Information Systems.

[23] Lorenzo Rosasco,et al. Holographic Embeddings of Knowledge Graphs , 2015, AAAI.

[24] Andrew S. Tanenbaum,et al. Distributed Systems , 2007 .

[25] Marvin Theimer,et al. Session guarantees for weakly consistent replicated data , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[26] Petr Sojka,et al. Software Framework for Topic Modelling with Large Corpora , 2010 .

[27] Gábor Németh,et al. DAL: A Locality-Optimizing Distributed Shared Memory System , 2017, HotCloud.

[28] Dong Yu,et al. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[29] Abutalib Aghayev,et al. STRADS-AP: Simplifying Distributed Machine Learning Programming without Introducing a New Programming Model , 2019, USENIX Annual Technical Conference.

[30] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[31] Seunghak Lee,et al. More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[32] Judy Qiu,et al. HarpLDA+: Optimizing latent dirichlet allocation for parallel efficiency , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[33] Eric P. Xing,et al. Exploiting iterative-ness for parallel ML computations , 2014, SoCC.

[34] Fan Yang,et al. FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[35] Alexander Peysakhovich,et al. PyTorch-BigGraph: A Large-scale Graph Embedding System , 2019, SysML.

[36] Ben Y. Zhao,et al. Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[37] Jianfeng Gao,et al. Embedding Entities and Relations for Learning and Inference in Knowledge Bases , 2014, ICLR.