Accelerator-centric deep learning systems for enhanced scalability, energy-efficiency, and programmability

Deep learning (DL) has been successfully deployed in various application domains ranging from computer vision, speech recognition, to natural language processing. As the network models and the datasets used to train these models scale, system architects are faced with new challenges in designing a scalable and energy-efficient high-performance computing (HPC) system for training DL algorithms. One of the key obstacles that DL researchers are facing is the memory capacity bottleneck, where the limited physical memory size of the PCIe-attached DL accelerator (whether it be a discrete GPU or ASIC accelerator like Google's Tensor Processing Unit) constrains the algorithm that can be studied. In this paper, and the associated invited special session talk, we first discuss recent research literature geared towards designing scalable HPC systems for DL. In this context, we then discuss the memory capacity wall problem and introduce the work on virtualized deep neural networks, a memory virtualization solution that systematically reduces the memory consumption of DNN training. We conclude by providing projections on future challenges DNN memory virtualization will encounter upon and suggest accelerator-centric DL system as a promising research direction for the development of a scalable and energy-efficient deep learning system architecture.

[1]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[2]  Dhabaleswar K. Panda,et al.  Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning , 2016, EuroMPI.

[3]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[4]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[7]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[9]  Frédéric Pétrot,et al.  Ternary neural networks for resource-efficient AI applications , 2016, 2017 International Joint Conference on Neural Networks (IJCNN).

[10]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[14]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[15]  Pengtao Xie,et al.  Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters , 2017, USENIX Annual Technical Conference.

[16]  Stephen W. Keckler,et al.  Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[17]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  J. Jeddeloh,et al.  Hybrid memory cube new DRAM architecture increases density and performance , 2012, 2012 Symposium on VLSI Technology (VLSIT).

[19]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[20]  NVIDIA DGX-1 System Architecture White paper , 2017 .

[21]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[22]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Andrew Lavin,et al.  Fast Algorithms for Convolutional Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[25]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[26]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[27]  Dhabaleswar K. Panda,et al.  Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[28]  David W. Nellans,et al.  Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Dhabaleswar K. Panda,et al.  Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? , 2017, EuroMPI.

[30]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[31]  Igor Carron,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016 .

[32]  Jaewon Lee,et al.  DCS: A fast and scalable device-centric server architecture , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[34]  Jeff Johnson,et al.  Fast Convolutional Nets With fbfft: A GPU Performance Evaluation , 2014, ICLR.

[35]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.