Efficient sparse collective communication and its application to accelerate distributed deep learning
暂无分享,去创建一个
Marco Canini | Amedeo Sapio | Jiawei Fei | Atal Narayan Sahu | Chen-Yu Ho | M. Canini | Chen-Yu Ho | Amedeo Sapio | Jiawei Fei
[1] Samuel Williams,et al. Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression , 2018, 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).
[2] Minsik Cho,et al. BlueConnect: Decomposing all-reduce for deep learning on heterogeneous network hierarchy , 2019, IBM J. Res. Dev..
[3] Tian Zhou,et al. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving , 2020, WSDM.
[4] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[5] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Nikko Strom,et al. Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.
[7] J. Kiefer,et al. Stochastic Estimation of the Maximum of a Regression Function , 1952 .
[8] John F. Canny,et al. Kylix: A Sparse Allreduce for Commodity Clusters , 2014, 2014 43rd International Conference on Parallel Processing.
[9] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..
[10] Nikhil R. Devanur,et al. PipeDream: generalized pipeline parallelism for DNN training , 2019, SOSP.
[11] Panos Kalnis,et al. Scaling Distributed Machine Learning with In-Network Aggregation , 2019, NSDI.
[12] Xin Yuan,et al. Bandwidth optimal all-reduce algorithms for clusters of workstations , 2009, J. Parallel Distributed Comput..
[13] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.
[14] Yibo Zhu,et al. A generic communication scheduler for distributed DNN training acceleration , 2019, SOSP.
[15] Dan Alistarh,et al. QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.
[16] Aritra Dutta,et al. GRACE: A Compressed Communication Framework for Distributed Machine Learning , 2021, 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS).
[17] Sangeetha Abdu Jyothi,et al. TicTac: Accelerating Distributed Deep Learning with Communication Scheduling , 2018, MLSys.
[18] Janis Keuper,et al. Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).
[19] Michael M. Swift,et al. ATP: In-network Aggregation for Multi-tenant Learning , 2021, NSDI.
[20] Cong Xu,et al. TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning , 2017, NIPS.
[21] Mohammad Alian,et al. A Network-Centric Hardware/Algorithm Co-Design to Accelerate Distributed Training of Deep Neural Networks , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[22] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[23] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[24] Suhas Diggavi,et al. Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations , 2019, IEEE Journal on Selected Areas in Information Theory.
[25] George Varghese,et al. P4: programming protocol-independent packet processors , 2013, CCRV.
[26] A KonstanJoseph,et al. The MovieLens Datasets , 2015 .
[27] Charles R. Qi,et al. Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks , 2018, ICML.
[28] James T. Kwok,et al. Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback , 2019, NeurIPS.
[29] Nikhil R. Devanur,et al. Blink: Fast and Generic Collectives for Distributed ML , 2019, MLSys.
[30] Byung-Gon Chun,et al. Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks , 2018, EuroSys.
[31] Nan Jiang,et al. An In-Network Architecture for Accelerating Shared-Memory Multiprocessor Collectives , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).
[32] Martin Jaggi,et al. Error Feedback Fixes SignSGD and other Gradient Compression Schemes , 2019, ICML.
[33] Dan Alistarh,et al. The Convergence of Sparsified Gradient Methods , 2018, NeurIPS.
[34] Yonghui Wu,et al. Exploring the Limits of Language Modeling , 2016, ArXiv.
[35] Martin Jaggi,et al. Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication , 2019, ICML.
[36] Alexander G. Schwing,et al. Accelerating Distributed Reinforcement learning with In-Switch Computing , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).
[37] Amar Phanishayee,et al. Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training , 2018, SoCC.
[38] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .
[39] Gennady Pekhimenko,et al. Priority-based Parameter Propagation for Distributed DNN Training , 2019, SysML.
[40] Tao Lin,et al. Don't Use Large Mini-Batches, Use Local SGD , 2018, ICLR.
[41] Ahmed M. Abdelmoniem,et al. On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning , 2019, AAAI.
[42] SparCML , 2019, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
[43] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.
[44] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.
[45] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.
[46] Aritra Dutta,et al. DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep Learning , 2021, ArXiv.
[47] F. Maxwell Harper,et al. The MovieLens Datasets: History and Context , 2016, TIIS.
[48] Quoc V. Le,et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, ArXiv.
[49] Trishul M. Chilimbi,et al. Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.
[50] Dan Alistarh,et al. SparCML: high-performance sparse communication for machine learning , 2018, SC.
[51] Panos Kalnis,et al. In-Network Computation is a Dumb Idea Whose Time Has Come , 2017, HotNets.
[52] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.
[53] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[54] Sam Ade Jacobs,et al. Communication Quantization for Data-Parallel Training of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).
[55] Jesper Larsson Träff. Transparent Neutral Element Elimination in MPI Reduction Operations , 2010, EuroMPI.
[56] Yibo Zhu,et al. A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters , 2020, OSDI.
[57] Gudula Rünger,et al. MPI Reduction Operations for Sparse Floating-point Data , 2008, PVM/MPI.
[58] Tat-Seng Chua,et al. Neural Collaborative Filtering , 2017, WWW.
[59] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.
[60] Paolo Costa,et al. In-network Aggregation for Shared Machine Learning Clusters , 2021, MLSys.
[61] Zhen Zhang,et al. Is Network the Bottleneck of Distributed Training? , 2020, NetAI@SIGCOMM.
[62] Martin Jaggi,et al. Sparsified SGD with Memory , 2018, NeurIPS.
[63] Torsten Hoefler,et al. Flare: flexible in-network allreduce , 2021, SC.
[64] Nenghai Yu,et al. Asynchronous Stochastic Gradient Descent with Delay Compensation , 2016, ICML.
[65] Valentin Petrov,et al. Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation , 2020, ISC.
[66] Percy Liang,et al. Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.
[67] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.