论文信息 - Efficient sparse collective communication and its application to accelerate distributed deep learning

Efficient sparse collective communication and its application to accelerate distributed deep learning

[1] Samuel Williams,et al. Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).

[2] Minsik Cho,et al. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy , 2019, IBM J. Res. Dev..

[3] Tian Zhou,et al. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving , 2020, WSDM.

[4] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[7] J. Kiefer,et al. Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[8] John F. Canny,et al. Kylix: A Sparse Allreduce for Commodity Clusters , 2014, 2014 43rd International Conference on Parallel Processing.

[9] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[10] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.

[11] Panos Kalnis,et al. Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.

[12] Xin Yuan,et al. Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..

[13] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.

[14] Yibo Zhu,et al. A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.

[15] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[16] Aritra Dutta,et al. GRACE: A Compressed Communication Framework for Distributed Machine Learning , 2021, 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS).

[17] Sangeetha Abdu Jyothi,et al. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.

[18] Janis Keuper,et al. Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[19] Michael M. Swift,et al. ATP: In-network Aggregation for Multi-tenant Learning , 2021, NSDI.

[20] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.

[21] Mohammad Alian,et al. A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[22] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[23] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[24] Suhas Diggavi,et al. Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.

[25] George Varghese,et al. P4: programming protocol-independent packet processors , 2013, CCRV.

[26] A KonstanJoseph,et al. The MovieLens Datasets , 2015 .

[27] Charles R. Qi,et al. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks , 2018, ICML.

[28] James T. Kwok,et al. Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback , 2019, NeurIPS.

[29] Nikhil R. Devanur,et al. Blink: Fast and Generic Collectives for Distributed ML , 2019, MLSys.

[30] Byung-Gon Chun,et al. Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.

[31] Nan Jiang,et al. An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[32] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.

[33] Dan Alistarh,et al. The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.

[34] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.

[35] Martin Jaggi,et al. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.

[36] Alexander G. Schwing,et al. Accelerating Distributed Reinforcement learning with In-Switch Computing , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[37] Amar Phanishayee,et al. Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training , 2018, SoCC.

[38] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .

[39] Gennady Pekhimenko,et al. Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.

[40] Tao Lin,et al. Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.

[41] Ahmed M. Abdelmoniem,et al. On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning , 2019, AAAI.

[42] SparCML , 2019, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.

[43] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[44] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[45] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[46] Aritra Dutta,et al. DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep Learning , 2021, ArXiv.

[47] F. Maxwell Harper,et al. The MovieLens Datasets: History and Context , 2016, TIIS.

[48] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.

[49] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[50] Dan Alistarh,et al. SparCML: high-performance sparse communication for machine learning , 2018, SC.

[51] Panos Kalnis,et al. In-Network Computation is a Dumb Idea Whose Time Has Come , 2017, HotNets.

[52] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[53] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[54] Sam Ade Jacobs,et al. Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[55] Jesper Larsson Träff. Transparent Neutral Element Elimination in MPI Reduction Operations , 2010, EuroMPI.

[56] Yibo Zhu,et al. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.

[57] Gudula Rünger,et al. MPI Reduction Operations for Sparse Floating-point Data , 2008, PVM/MPI.

[58] Tat-Seng Chua,et al. Neural Collaborative Filtering , 2017, WWW.

[59] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[60] Paolo Costa,et al. In-network Aggregation for Shared Machine Learning Clusters , 2021, MLSys.

[61] Zhen Zhang,et al. Is Network the Bottleneck of Distributed Training? , 2020, NetAI@SIGCOMM.

[62] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.

[63] Torsten Hoefler,et al. Flare: flexible in-network allreduce , 2021, SC.

[64] Nenghai Yu,et al. Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.

[65] Valentin Petrov,et al. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation , 2020, ISC.

[66] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[67] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.