Warp-Based Load/Store Reordering to Improve GPU Data Cache Time Predictability and Performance

Graphics Processing Units (GPUs) have great potential to improve the performance and energy efficiency for data-parallel real-time applications. However, it is very difficult to compute worst-case execution time (WCET) for current GPUs that are design for improving the average-case throughput, not for time predictability. In this paper, we propose a warp-based load/store reordering mechanism to improve the time predictability of GPU data caching without incurring much performance overhead. This mechanism can be used in conjunction with dynamic warp scheduling to achieve better performance than a pure round-robin based scheduling while enabling accurate static timing analysis to bound the worst-case GPU L1 data cache misses.

[1]  John D. Owens,et al.  Real-time parallel hashing on the GPU , 2009, SIGGRAPH 2009.

[2]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.

[3]  Abhik Roychoudhury,et al.  Scope-Aware Data Cache Analysis for WCET Estimation , 2011, 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium.

[4]  Reinhard Wilhelm,et al.  Cache Behavior Prediction by Abstract Interpretation , 1996, Sci. Comput. Program..

[5]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[6]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Björn Andersson,et al.  Assigning Real-Time Tasks on Heterogeneous Multiprocessors with Two Unrelated Types of Processors , 2010, RTSS.

[8]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[9]  Wei Zhang,et al.  WCET Analysis for Multi-Core Processors with Shared L2 Instruction Caches , 2008, 2008 IEEE Real-Time and Embedded Technology and Applications Symposium.

[10]  Yun Liang,et al.  WCET-centric partial instruction cache locking , 2012, DAC Design Automation Conference 2012.

[11]  Margaret Martonosi,et al.  MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[12]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13]  Tulika Mitra,et al.  Exploring locking & partitioning for predictable shared caches on multi-cores , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[14]  Kevin Skadron,et al.  Accelerating Compute-Intensive Applications with GPUs and FPGAs , 2008, 2008 Symposium on Application Specific Processors.

[15]  Wei Zhang,et al.  Impact of L2 cache locking on GPU performance , 2015, SoutheastCon 2015.

[16]  Assaf Schuster,et al.  Processing data streams with hard real-time constraints on heterogeneous systems , 2011, ICS '11.

[17]  Martin Schoeberl,et al.  A Time Predictable Instruction Cache for a Java Processor , 2004, OTM Workshops.

[18]  Yun Liang,et al.  Timing analysis of concurrent programs running on shared cache multi-cores , 2009, 2009 30th IEEE Real-Time Systems Symposium.

[19]  James H. Anderson,et al.  GPUSync: Architecture-Aware Management of GPUs for Predictable Multi-GPU Real-Time Systems , 2012 .

[20]  Björn Lisper,et al.  Data cache locking for higher program predictability , 2003, SIGMETRICS '03.

[21]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[22]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[23]  Damien Hardy,et al.  WCET Analysis of Multi-level Non-inclusive Set-Associative Instruction Caches , 2008, 2008 Real-Time Systems Symposium.

[24]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[25]  James H. Anderson,et al.  Globally scheduled real-time multiprocessor systems with GPUs , 2011, Real-Time Systems.