An Evaluation of the NVIDIA TX1 for Supporting Real-Time Computer-Vision Workloads

Autonomous vehicles are an exemplar for forward-looking safety-critical real-time systems where significant computing capacity must be provided within strict size, weight, and power (SWaP) limits. A promising way forward in meeting these needs is to leverage multicore platforms augmented with graphics processing units (GPUs) as accelerators. Such an approach is being strongly advocated by NVIDIA, whose Jetson TX1 board is currently a leading multicore+GPU solution marketed for autonomous systems. Unfortunately, no study has ever been published that expressly evaluates the effectiveness of the TX1, or any other comparable platform, in hosting safety-critical real-time workloads. In this paper, such a study is presented. Specifically, the TX1 is evaluated via benchmarking efforts, blackbox evaluations of GPU behavior, and case-study evaluations involving computer-vision workloads inspired by autonomousdriving use cases. Autonomous vehicles are an exemplar for forward-looking safety-critical real-time systems where significant computing capacity must be provided within strict size, weight, and power (SWaP) limits. A promising way forward in meeting these needs is to leverage multicore platforms augmented with graphics processing units (GPUs) as accelerators. Such an approach is being strongly advocated by NVIDIA, whose Jetson TX1 board is currently a leading multicore+GPU solution marketed for autonomous systems. Unfortunately, no study has ever been published that expressly evaluates the effectiveness of the TX1, or any other comparable platform, in hosting safety-critical real-time workloads. In this paper, such a study is presented. Specifically, the TX1 is evaluated via benchmarking efforts, blackbox evaluations of GPU behavior, and case-study evaluations involving computer-vision workloads inspired by autonomousdriving use cases.

[1]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[2]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[3]  Shinpei Kato,et al.  RGEM: A Responsive GPGPU Execution Model for Runtime Engines , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[4]  Jan Reineke,et al.  CAMA: A Predictable Cache-Aware Memory Allocator , 2011, 2011 23rd Euromicro Conference on Real-Time Systems.

[5]  Björn Andersson,et al.  Makespan Computation for GPU Threads Running on a Single Streaming Multiprocessor , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[6]  Avi Mendelson,et al.  Scheduling processing of real-time data streams on heterogeneous multi-GPU systems , 2012, SYSTOR '12.

[7]  Shinpei Kato,et al.  Supporting Low-Latency CPS Using GPUs and Direct I/O Schemes , 2012, 2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications.

[8]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[9]  Kyoung-Don Kang,et al.  Supporting Preemptive Task Executions and Memory Copies in GPGPUs , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[10]  Ragunathan Rajkumar,et al.  A Coordinated Approach for Practical OS-Level Cache Management in Multi-core Real-Time Systems , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[11]  James H. Anderson,et al.  GPUSync: A Framework for Real-Time GPU Management , 2013, 2013 IEEE 34th Real-Time Systems Symposium.

[12]  Konstantinos Bletsas,et al.  Faster makespan estimation for GPU threads on a single streaming multiprocessor , 2013, 2013 IEEE 18th Conference on Emerging Technologies & Factory Automation (ETFA).

[13]  Adam Betts,et al.  Estimating the WCET of GPU-Accelerated Applications Using Hybrid Analysis , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[14]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Avi Mendelson,et al.  Scheduling periodic real-time communication in multi-GPU systems , 2014, 2014 23rd International Conference on Computer Communication and Networks (ICCCN).

[16]  Avi Mendelson,et al.  Batch Method for Efficient Resource Sharing in Real-Time Multi-GPU Systems , 2014, ICDCN.

[17]  Eduardo Tovar,et al.  WCET Measurement-based and Extreme Value Theory Characterisation of CUDA Kernels , 2014, RTNS.

[18]  Jianlong Zhong,et al.  Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.

[19]  Cong Liu,et al.  GPES: a preemptive execution system for GPGPU computing , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[20]  Henk Corporaal,et al.  Adaptive and transparent cache bypassing for GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Insup Lee,et al.  Analysis and Implementation of Global Preemptive Fixed-Priority Scheduling with Dynamic Cache Allocation , 2016, 2016 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).

[22]  F. D. Smith,et al.  GPU Sharing for Image Processing in Embedded Real-Time Systems∗ , 2016 .

[23]  Depei Qian,et al.  Scheduling Tasks with Mixed Timing Constraints in GPU-Powered Real-Time Systems , 2016, ICS.

[24]  Petru Eles,et al.  Systematic detection of memory related performance bottlenecks in GPGPU programs , 2016, J. Syst. Archit..

[25]  Eduardo Tovar,et al.  Measurement-Based Probabilistic Timing Analysis for Graphics Processor Units , 2016, ARCS.

[26]  Mohammad Abdullah Al Faruque,et al.  Run-Time Scheduling Framework for Event-Driven Applications on a GPU-Based Embedded System , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[27]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.