Static WCET Analysis of GPUs with Predictable Warp Scheduling

The capability of GPUs to accelerate general-purpose applications that can be parallelized into massive number of threads makes it promising to apply GPUs to real-time applications as well, where high throughput and intensive computation are also needed. However, due to the different architecture and programming model of GPUs, the worst-case execution time (WCET) analysis methods and techniques designed for CPUs cannot be used directly to estimate the WCET of GPUs. In this work, based on the analysis of the architecture and dynamic behavior of GPUs, we propose a WCET timing model and analyzer based on a predictable GPU warp scheduling policy to enable the WCET estimation on GPUs.

[1]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[2]  Eduardo Tovar,et al.  WCET Measurement-based and Extreme Value Theory Characterisation of CUDA Kernels , 2014, RTNS.

[3]  Jürgen Schmidhuber,et al.  Multi-column deep neural network for traffic sign classification , 2012, Neural Networks.

[4]  James H. Anderson,et al.  GPUSync: Architecture-Aware Management of GPUs for Predictable Multi-GPU Real-Time Systems , 2012 .

[5]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Yun Liang,et al.  An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[7]  Adam Betts,et al.  Estimating the WCET of GPU-Accelerated Applications Using Hybrid Analysis , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[8]  Tor M. Aamodt,et al.  Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware , 2009, TACO.

[9]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Wei Zhang,et al.  Warp-Based Load/Store Reordering to Improve GPU Data Cache Time Predictability and Performance , 2016, 2016 IEEE 19th International Symposium on Real-Time Distributed Computing (ISORC).

[11]  Martin Schoeberl,et al.  Time-Predictable Computer Architecture , 2009, EURASIP J. Embed. Syst..

[12]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[13]  Anders Eklund,et al.  Medical image processing on the GPU - Past, present and future , 2013, Medical Image Anal..

[14]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[15]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[16]  James H. Anderson,et al.  Globally scheduled real-time multiprocessor systems with GPUs , 2011, Real-Time Systems.

[17]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[18]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.