论文信息 - Efficient Distributed Algorithms for Convolutional Neural Networks

Efficient Distributed Algorithms for Convolutional Neural Networks

Several efficient distributed algorithms have been developed for matrix-matrix multiplication: the 3D algorithm, the 2D SUMMA algorithm, and the 2.5D algorithm. Each of these algorithms was independently conceived and they trade-off memory needed per node and the inter-node data communication volume. The convolutional neural network (CNN) computation may be viewed as a generalization of matrix-multiplication combined with neighborhood stencil computations. We develop communication-efficient distributed-memory algorithms for CNNs that are analogous to the 2D/2.5D/3D algorithms for matrix-matrix multiplication.

Aravind Sukumaran-Rajam | Atanas Rountev | P Sadayappan | Rui Li | Yufan Xu

[1] Richard Veras,et al. Analytical cache modeling and tilesize optimization for tensor contractions , 2019, SC.

[2] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[3] Alexander Aiken,et al. Beyond Data and Model Parallelism for Deep Neural Networks , 2018, SysML.

[4] S. Lennart Johnsson,et al. Minimizing the Communication Time for Matrix Multiplication on Multiprocessors , 1993, Parallel Comput..

[5] Alexander Sergeev,et al. Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[6] Vin Sharma,et al. Tuna: A Static Analysis Approach to Optimizing Deep Neural Networks , 2021, ArXiv.

[7] John Tran,et al. cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[8] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[9] Rui Li,et al. Analytical characterization and design space exploration for optimization of CNNs , 2021, ASPLOS.

[10] Haichen Shen,et al. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018, OSDI.

[11] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[12] Cody Hao Yu,et al. Ansor : Generating High-Performance Tensor Programs for Deep Learning , 2020, OSDI.

[13] Yida Wang,et al. Optimizing CNN Model Inference on CPUs , 2018, USENIX Annual Technical Conference.

[14] Sartaj Sahni,et al. Parallel Matrix and Graph Algorithms , 1981, SIAM J. Comput..

[15] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.