Hotness- and Lifetime-Aware Data Placement and Migration for High-Performance Deep Learning on Heterogeneous Memory Systems

Heterogeneous memory systems that comprise memory nodes with disparate architectural characteristics (e.g., DRAM and high-bandwidth memory (HBM)) have surfaced as a promising solution in a variety of computing domains ranging from embedded to high-performance computing. Since deep learning (DL) is one of the most widely-used workloads in various computing domains, it is crucial to explore efficient memory management techniques for DL applications that execute on heterogeneous memory systems. Despite extensive prior works on system software and architectural support for efficient DL, it still remains unexplored to investigate heterogeneity-aware memory management techniques for high-performance DL on heterogeneous memory systems. To bridge this gap, we analyze the characteristics of representative DL workloads on a real heterogeneous memory system. Guided by the characterization results, we propose HALO, hotness- and lifetime-aware data placement and migration for high-performance DL on heterogeneous memory systems. Through quantitative evaluation, we demonstrate the effectiveness of HALO in that it significantly outperforms various memory management policies (e.g., 28.2 percent higher performance than the HBM-Preferred policy) supported by the underlying system software and hardware, achieves the performance comparable to the ideal case with infinite HBM, incurs small performance overheads, and delivers high performance across a wide range of application working-set sizes.

[1]  Trishul M. Chilimbi,et al.  Project Adam: Building an Efficient and Scalable Deep Learning Training System , 2014, OSDI.

[2]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[3]  Ian Miguel,et al.  The Temporal Knapsack Problem and Its Solution , 2005, CPAIOR.

[4]  Kyu Yeun Kim,et al.  BLPP: Improving the Performance of GPGPUs with Heterogeneous Memory through Bandwidth- and Latency-Aware Page Placement , 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD).

[5]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[6]  Aamer Jaleel,et al.  ACCORD: Enabling Associativity for Gigascale DRAM Caches by Coordinating Way-Install and Way-Prediction , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Mainak Chaudhuri,et al.  Near-Optimal Access Partitioning for Memory Hierarchies with Multiple Heterogeneous Bandwidth Sources , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[9]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[10]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[11]  Hao Wang,et al.  DUANG: Fast and lightweight page migration in asymmetric memory systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[12]  Sabela Ramos,et al.  Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[13]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[15]  Srinivas Devadas,et al.  Banshee: Bandwidth-Efficient DRAM Caching via Software/Hardware Cooperation , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[18]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[19]  Lizy Kurian John,et al.  A Case for Granularity Aware Page Migration , 2018, ICS.

[20]  Gu-Yeon Wei,et al.  Fathom: reference workloads for modern deep learning methods , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[21]  Vivien Quéma,et al.  Large Pages May Be Harmful on NUMA Systems , 2014, USENIX Annual Technical Conference.

[22]  Xiaowei Li,et al.  C-Brain: A deep learning accelerator that tames the diversity of CNNs through adaptive data-level parallelization , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[23]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[24]  Thomas F. Wenisch,et al.  High-Performance Transactions for Persistent Memories , 2016, ASPLOS.

[25]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[26]  Amar Phanishayee,et al.  Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Zi Yan,et al.  Nimble Page Management for Tiered Memory Systems , 2019, ASPLOS.

[29]  Woongki Baek,et al.  Design and implementation of bandwidth-aware memory placement and migration policies for heterogeneous memory systems , 2017, ICS '17.

[30]  Zenglin Xu,et al.  Superneurons: dynamic GPU memory management for training deep neural networks , 2018, PPoPP.

[31]  Sanjay Kumar,et al.  System software for persistent memory , 2014, EuroSys '14.

[32]  Minsoo Rhu,et al.  Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[34]  Xiaoming Chen,et al.  moDNN: Memory optimal DNN training on GPUs , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[35]  Aamer Jaleel,et al.  BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[36]  Ludmila Cherkasova,et al.  ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems , 2018, ICS.

[37]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[38]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.