WCET analysis of the shared data cache in integrated CPU-GPU architectures

By taking the advantages of both CPU and GPU as well as the shared DRAM and cache, the integrated CPU-GPU architecture has the potential to boost the performance for a variety of applications, including real-time applications as well. However, before being applied to the hard real-time and safety-critical applications, the time-predictability of the integrated CPU-GPU architecture needs to be studied and improved. In this work, we study the shared data Last Level Cache (LLC) in the integrated CPU-GPU architecture and propose to use an access interval based method to improve the time-predictability of the LLC. The results show that the proposed technique can effectively improve the accuracy of the miss rate estimation in the LLC. We also find that the improved LLC miss rate estimations can be used to further improve the WCET estimations of GPU kernels running on such an architecture.

[1]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[2]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[3]  Mitsuhisa Sato,et al.  GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Parallelized Accelerated Computing , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[4]  Thomas Fahringer,et al.  An automatic input-sensitive approach for heterogeneous task partitioning , 2013, ICS '13.

[5]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI '03.

[6]  Adam Betts,et al.  Estimating the WCET of GPU-Accelerated Applications Using Hybrid Analysis , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[7]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[8]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[9]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[10]  Chao Yang,et al.  A peta-scalable CPU-GPU algorithm for global atmospheric simulations , 2013, PPoPP '13.

[11]  Marco Caccamo,et al.  Real-time cache management framework for multi-core architectures , 2013, 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[12]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[13]  Javier Cuenca,et al.  Optimization Techniques for 3D-FWT on Systems with Manycore GPUs and Multicore CPUs , 2013, ICCS.

[14]  Francisco J. Cazorla,et al.  Hardware support for WCET analysis of hard real-time multicore systems , 2009, ISCA '09.

[15]  Tulika Mitra,et al.  Modeling shared cache and bus in multi-cores for timing analysis , 2010, SCOPES.

[16]  Dong Li,et al.  The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[17]  David R. Kaeli,et al.  Quantifying the energy efficiency of FFT on heterogeneous platforms , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[18]  Lui Sha,et al.  Real-Time Computing on Multicore Processors , 2016, Computer.

[19]  Laxmi N. Bhuyan,et al.  A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures , 2013, TACO.

[20]  James H. Anderson,et al.  Outstanding Paper Award: Making Shared Caches More Predictable on Multicore Platforms , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[21]  Eduardo Tovar,et al.  WCET Measurement-based and Extreme Value Theory Characterisation of CUDA Kernels , 2014, RTNS.

[22]  Tomasz P. Stefanski Implementation of FDTD-Compatible Green's Function on Heterogeneous Cpu-GPU Parallel Processing System , 2013 .

[23]  Wei Jiang,et al.  MATE-CG: A Map Reduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[24]  Anand Raghunathan,et al.  Automatic generation of software pipelines for heterogeneous parallel systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Wu-chun Feng,et al.  On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[26]  Wei Zhang,et al.  Static WCET Analysis of GPUs with Predictable Warp Scheduling , 2017, 2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC).