Dynamic parameter allocation in parameter servers

To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in distributed training---, but can induce severe communication overhead. To reduce communication overhead, distributed machine learning algorithms use techniques to increase parameter access locality (PAL), achieving up to linear speed-ups. We found that existing parameter servers provide only limited support for PAL techniques, however, and therefore prevent efficient training. In this paper, we explore whether and to what extent PAL techniques can be supported, and whether such support is beneficial. We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse, and experimentally compare its performance to existing parameter servers across a number of machine learning tasks. We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers.

[1]  Jason Weston,et al.  Translating Embeddings for Modeling Multi-relational Data , 2013, NIPS.

[2]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[3]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[4]  Zhipeng Zhang,et al.  PS2: Parameter Server on Spark , 2019, SIGMOD Conference.

[5]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[6]  Evgeniy Gabrilovich,et al.  A Review of Relational Machine Learning for Knowledge Graphs , 2015, Proceedings of the IEEE.

[7]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[8]  Hans-Peter Kriegel,et al.  A Three-Way Model for Collective Learning on Multi-Relational Data , 2011, ICML.

[9]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[10]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[11]  Yaoliang Yu,et al.  Petuum: A New Platform for Distributed Machine Learning on Big Data , 2013, IEEE Transactions on Big Data.

[12]  Christos Faloutsos,et al.  FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop , 2014, SDM.

[13]  S. V. N. Vishwanathan,et al.  Scaling Multinomial Logistic Regression via Hybrid Parallelism , 2019, KDD.

[14]  Jiawei Jiang,et al.  Heterogeneity-aware Distributed Parameter Servers , 2017, SIGMOD Conference.

[15]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[16]  Yiming Yang,et al.  Analogical Inference for Multi-relational Embeddings , 2017, ICML.

[17]  Binyu Zang,et al.  PowerLyra: Differentiated Graph Computation and Partitioning on Skewed Graphs , 2019, TOPC.

[18]  Mustaque Ahamad,et al.  Slow memory: weakening consistency to enhance concurrency in distributed shared memories , 1990, Proceedings.,10th International Conference on Distributed Computing Systems.

[19]  Guillaume Bouchard,et al.  Complex Embeddings for Simple Link Prediction , 2016, ICML.

[20]  Alexander J. Smola,et al.  Scalable inference in latent variable models , 2012, WSDM '12.

[21]  Inderjit S. Dhillon,et al.  A Scalable Asynchronous Distributed Algorithm for Topic Modeling , 2014, WWW.

[22]  Peter J. Haas,et al.  Shared-memory and shared-nothing stochastic gradient descent algorithms for matrix completion , 2013, Knowledge and Information Systems.

[23]  Lorenzo Rosasco,et al.  Holographic Embeddings of Knowledge Graphs , 2015, AAAI.

[24]  Andrew S. Tanenbaum,et al.  Distributed Systems , 2007 .

[25]  Marvin Theimer,et al.  Session guarantees for weakly consistent replicated data , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[26]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[27]  Gábor Németh,et al.  DAL: A Locality-Optimizing Distributed Shared Memory System , 2017, HotCloud.

[28]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[29]  Abutalib Aghayev,et al.  STRADS-AP: Simplifying Distributed Machine Learning Programming without Introducing a New Programming Model , 2019, USENIX Annual Technical Conference.

[30]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[31]  Seunghak Lee,et al.  More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server , 2013, NIPS.

[32]  Judy Qiu,et al.  HarpLDA+: Optimizing latent dirichlet allocation for parallel efficiency , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[33]  Eric P. Xing,et al.  Exploiting iterative-ness for parallel ML computations , 2014, SoCC.

[34]  Fan Yang,et al.  FlexPS: Flexible Parallelism Control in Parameter Server Architecture , 2018, Proc. VLDB Endow..

[35]  Alexander Peysakhovich,et al.  PyTorch-BigGraph: A Large-scale Graph Embedding System , 2019, SysML.

[36]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[37]  Jianfeng Gao,et al.  Embedding Entities and Relations for Learning and Inference in Knowledge Bases , 2014, ICLR.

[38]  Fan Yang,et al.  Husky: Towards a More Efficient and Expressive Distributed Computing Framework , 2016, Proc. VLDB Endow..

[39]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[40]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[41]  Rolf Jagerman,et al.  Computing Web-scale Topic Models using an Asynchronous Parameter Server , 2017, SIGIR.

[42]  Scott Shenker,et al.  Epidemic algorithms for replicated database maintenance , 1988, OPSR.

[43]  A. Fleischmann Distributed Systems , 1994, Springer Berlin Heidelberg.

[44]  Andrew Y. Ng,et al.  Parsing with Compositional Vector Grammars , 2013, ACL.

[45]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[46]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[47]  Joseph Gonzalez,et al.  PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs , 2012, OSDI.

[48]  Rainer Gemulla,et al.  You CAN Teach an Old Dog New Tricks! On Training Knowledge Graph Embeddings , 2020, ICLR.

[49]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[50]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[51]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[52]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[53]  Alexander J. Smola,et al.  Communication Efficient Distributed Machine Learning with the Parameter Server , 2014, NIPS.

[54]  MarklVolker,et al.  Dynamic parameter allocation in parameter servers , 2020, VLDB 2020.

[55]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[56]  Tim Weninger,et al.  Open-World Knowledge Graph Completion , 2017, AAAI.

[57]  Alexander J. Smola,et al.  An architecture for parallel topic models , 2010, Proc. VLDB Endow..

[58]  K. Kavi Cache Memories Cache Memories in Uniprocessors. Reading versus Writing. Improving Performance , 2022 .

[59]  Rainer Gemulla,et al.  Distributed Matrix Completion , 2012, 2012 IEEE 12th International Conference on Data Mining.

[60]  Inderjit S. Dhillon,et al.  NOMAD: Nonlocking, stOchastic Multi-machine algorithm for Asynchronous and Decentralized matrix completion , 2013, Proc. VLDB Endow..

[61]  Supun Nakandala,et al.  Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems , 2019, DEEM@SIGMOD.

[62]  Alexander J. Smola,et al.  Scaling Distributed Machine Learning with the Parameter Server , 2014, OSDI.

[63]  Eric P. Xing,et al.  High-Performance Distributed ML at Scale through Parameter Server Consistency Models , 2014, AAAI.

[64]  Gennady Pekhimenko,et al.  Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[65]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[66]  Xiaodong Liu,et al.  Stochastic Answer Networks for Machine Reading Comprehension , 2017, ACL.

[67]  Seunghak Lee,et al.  STRADS: a distributed framework for scheduled model parallel machine learning , 2016, EuroSys.