Efficient sparse collective communication and its application to accelerate distributed deep learning

[1]  Samuel Williams,et al.  Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[2]  Minsik Cho,et al.  BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy , 2019, IBM J. Res. Dev..

[3]  Tian Zhou,et al.  DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving , 2020, WSDM.

[4]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[7]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[8]  John F. Canny,et al.  Kylix: A Sparse Allreduce for Commodity Clusters , 2014, 2014 43rd International Conference on Parallel Processing.

[9]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[10]  Nikhil R. Devanur,et al.  PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[11]  Panos Kalnis,et al.  Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.

[12]  Xin Yuan,et al.  Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[13]  William J. Dally,et al.  Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[14]  Yibo Zhu,et al.  A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[15]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[16]  Aritra Dutta,et al.  GRACE: A Compressed Communication Framework for Distributed Machine Learning , 2021, 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS).

[17]  Sangeetha Abdu Jyothi,et al.  TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[18]  Janis Keuper,et al.  Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[19]  Michael M. Swift,et al.  ATP: In-network Aggregation for Multi-tenant Learning , 2021, NSDI.

[20]  Cong Xu,et al.  TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[21]  Mohammad Alian,et al.  A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[23]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[24]  Suhas Diggavi,et al.  Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[25]  George Varghese,et al.  P4: programming protocol-independent packet processors , 2013, CCRV.

[26]  A KonstanJoseph,et al.  The MovieLens Datasets , 2015 .

[27]  Charles R. Qi,et al.  Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks , 2018, ICML.

[28]  James T. Kwok,et al.  Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback , 2019, NeurIPS.

[29]  Nikhil R. Devanur,et al.  Blink: Fast and Generic Collectives for Distributed ML , 2019, MLSys.

[30]  Byung-Gon Chun,et al.  Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.

[31]  Nan Jiang,et al.  An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[32]  Martin Jaggi,et al.  Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[33]  Dan Alistarh,et al.  The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[34]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[35]  Martin Jaggi,et al.  Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[36]  Alexander G. Schwing,et al.  Accelerating Distributed Reinforcement learning with In-Switch Computing , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[37]  Amar Phanishayee,et al.  Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training , 2018, SoCC.

[38]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[39]  Gennady Pekhimenko,et al.  Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[40]  Tao Lin,et al.  Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[41]  Ahmed M. Abdelmoniem,et al.  On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning , 2019, AAAI.

[42]  SparCML , 2019, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[44]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[45]  Alexander Aiken,et al.  Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[46]  Aritra Dutta,et al.  DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep Learning , 2021, ArXiv.

[47]  F. Maxwell Harper,et al.  The MovieLens Datasets: History and Context , 2016, TIIS.

[48]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[49]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[50]  Dan Alistarh,et al.  SparCML: high-performance sparse communication for machine learning , 2018, SC.

[51]  Panos Kalnis,et al.  In-Network Computation is a Dumb Idea Whose Time Has Come , 2017, HotNets.

[52]  Kenneth Heafield,et al.  Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[53]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[54]  Sam Ade Jacobs,et al.  Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[55]  Jesper Larsson Träff Transparent Neutral Element Elimination in MPI Reduction Operations , 2010, EuroMPI.

[56]  Yibo Zhu,et al.  A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[57]  Gudula Rünger,et al.  MPI Reduction Operations for Sparse Floating-point Data , 2008, PVM/MPI.

[58]  Tat-Seng Chua,et al.  Neural Collaborative Filtering , 2017, WWW.

[59]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[60]  Paolo Costa,et al.  In-network Aggregation for Shared Machine Learning Clusters , 2021, MLSys.

[61]  Zhen Zhang,et al.  Is Network the Bottleneck of Distributed Training? , 2020, NetAI@SIGCOMM.

[62]  Martin Jaggi,et al.  Sparsified SGD with Memory , 2018, NeurIPS.

[63]  Torsten Hoefler,et al.  Flare: flexible in-network allreduce , 2021, SC.

[64]  Nenghai Yu,et al.  Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[65]  Valentin Petrov,et al.  Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation , 2020, ISC.

[66]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[67]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.