Understanding and Tackling the Hidden Memory Latency for Edge-based Heterogeneous Platform

With the burgeoning of autonomous driving, the edgedeployed integrated CPU/GPU (iGPU) platform gains significant attention from both academia and industries. NVIDIA issues a series of Jetson iGPU platforms that perform well in terms of computation capability, power consumption, and mobile size. However, these iGPU platforms typically contain very limited physical memory, which could be the bottleneck of these autonomous driving and edge computing applications. Although the introduction of the Unified Memory (UM) model in GPU programming can reduce the memory footprint, the programming legacy of the UM model initializes data on the CPU side by default as the conventional copyand-execute model does, which causes significant latency of application execution. In this paper, we propose an enhanced unified memory management model (eUMM), which delivers a prefetch-enhanced data initialization method on the GPU side of the iGPU platform. We evaluate eUMM on the representative Jetson TX2 and Xavier AGX platforms and demonstrate that eUMM not only reduces the initialization latency significantly but also benefits the following kernel computation and the entire application execution latency.

[1]  Alexandra Fedorova,et al.  Analyzing memory management methods on integrated CPU-GPU systems , 2017, ISMM.

[2]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Simon See,et al.  An Evaluation of Unified Memory Technology on NVIDIA GPUs , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[4]  Won Woo Ro,et al.  WASP: Selective Data Prefetching with Monitoring Runtime Warp Progress on GPUs , 2018, IEEE Transactions on Computers.

[5]  Richard W. Vuduc,et al.  Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[6]  David W. Nellans,et al.  Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[8]  Scott A. Mahlke,et al.  APOGEE: Adaptive prefetching on GPUs for energy efficiency , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[9]  Ming Yang,et al.  An Evaluation of the NVIDIA TX1 for Supporting Real-Time Computer-Vision Workloads , 2017, 2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).

[10]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[11]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[12]  Silvio Savarese,et al.  Learning to Track at 100 FPS with Deep Regression Networks , 2016, ECCV.

[13]  Rami G. Melhem,et al.  Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[14]  Yang Hu,et al.  Co-Optimizing Performance and Memory Footprint Via Integrated CPU/GPU Memory Management, an Implementation on Autonomous Driving Platform , 2020, 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).

[15]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).