Real-Time GPU Computing: Cache or No Cache?

Recent Graphics Processing Units (GPUs) have employed cache memories to boost performance. However, cache memories are well known to be harmful to time predictability for CPUs. For high-performance real-time systems using GPUs, it remains unknown whether or not cache memories should be employed. In this paper, we quantitatively compare the performance for GPUs with and without caches, and find that GPUs without the cache actually lead to better average-case performance, with higher time predictability. However, we also study a profiling-based cache bypassing method, which can use the L1 data cache more efficiently to achieve better average-case performance than that without the cache. Therefore, it seems still beneficial to employ caches for real-time computing on GPUs.

[1]  Gary S. Tyson,et al.  A modified approach to data cache management , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[2]  Björn Lisper,et al.  Data cache locking for higher program predictability , 2003, SIGMETRICS '03.

[3]  Antonia Zhai,et al.  Managing shared last-level cache in a heterogeneous multicore processor , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[4]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[5]  Margaret Martonosi,et al.  Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.

[6]  Bruce Jacob,et al.  Cache Design for Embedded Real-Time Systems , 1999 .

[7]  Tulika Mitra,et al.  Exploring locking & partitioning for predictable shared caches on multi-cores , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[8]  Mateo Valero,et al.  A Data Cache with Multiple Caching Strategies Tuned to Different Types of Locality , 1995, International Conference on Supercomputing.

[9]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[10]  Chyi-Chang Miao,et al.  Compiler managed micro-cache bypassing for high performance EPIC processors , 2002, MICRO.

[11]  Mark Silberstein,et al.  PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[12]  Abhik Roychoudhury,et al.  Scope-Aware Data Cache Analysis for WCET Estimation , 2011, 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium.

[13]  Wei Zhang,et al.  WCET Analysis for Multi-Core Processors with Shared L2 Instruction Caches , 2008, 2008 IEEE Real-Time and Embedded Technology and Applications Symposium.

[14]  N. England,et al.  Graphics Hardware , 2019, IEEE Computer Graphics and Applications.

[15]  A. Kurdila,et al.  Vision-based control of micro-air-vehicles: progress and problems in estimation , 2004, 2004 43rd IEEE Conference on Decision and Control (CDC) (IEEE Cat. No.04CH37601).

[16]  Damien Hardy,et al.  WCET Analysis of Multi-level Non-inclusive Set-Associative Instruction Caches , 2008, 2008 Real-Time Systems Symposium.

[17]  Kyoung-Don Kang,et al.  Supporting Preemptive Task Executions and Memory Copies in GPGPUs , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[18]  Jaehyuk Huh,et al.  Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[19]  Yun Liang,et al.  WCET-centric partial instruction cache locking , 2012, DAC Design Automation Conference 2012.

[20]  Assaf Schuster,et al.  Processing data streams with hard real-time constraints on heterogeneous systems , 2011, ICS '11.

[21]  Shinpei Kato,et al.  Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.

[22]  Shinpei Kato,et al.  RGEM: A Responsive GPGPU Execution Model for Runtime Engines , 2011, 2011 IEEE 32nd Real-Time Systems Symposium.

[23]  Björn Andersson,et al.  Assigning real-time tasks on heterogeneous multiprocessors with two unrelated types of processors , 2010, 2010 31st IEEE Real-Time Systems Symposium.

[24]  Yun Liang,et al.  Timing analysis of concurrent programs running on shared cache multi-cores , 2009, 2009 30th IEEE Real-Time Systems Symposium.

[25]  James H. Anderson,et al.  Robust Real-Time Multiprocessor Interrupt Handling Motivated by GPUs , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[26]  Wen-mei W. Hwu,et al.  Run-Time Cache Bypassing , 1999, IEEE Trans. Computers.

[27]  James H. Anderson,et al.  GPUSync: Architecture-Aware Management of GPUs for Predictable Multi-GPU Real-Time Systems , 2012 .

[28]  Bernhard Kainz,et al.  Ray-based Image Generation for Advanced Medical Applications , 2011 .

[29]  Weijun Xiao,et al.  Promise of embedded system with GPU in artificial leg control: Enabling time-frequency feature extraction from electromyography , 2009, 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society.

[30]  Reinhard Wilhelm,et al.  Cache Behavior Prediction by Abstract Interpretation , 1996, Sci. Comput. Program..

[31]  Frank Müller,et al.  Timing Analysis for Instruction Caches , 2000, Real-Time Systems.

[32]  David B. Whalley,et al.  Bounding worst-case instruction cache performance , 1994, 1994 Proceedings Real-Time Systems Symposium.

[33]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[34]  James H. Anderson,et al.  Globally scheduled real-time multiprocessor systems with GPUs , 2011, Real-Time Systems.

[35]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[36]  Shinpei Kato,et al.  TimeGraph: GPU Scheduling for Real-Time Multi-Tasking Environments , 2011, USENIX Annual Technical Conference.

[37]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[38]  Jakob Engblom,et al.  The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[39]  Shinpei Kato,et al.  Resource Sharing in GPU-Accelerated Windowing Systems , 2011, 2011 17th IEEE Real-Time and Embedded Technology and Applications Symposium.

[40]  James H. Anderson,et al.  Real-World Constraints of GPUs in Real-Time Systems , 2011, 2011 IEEE 17th International Conference on Embedded and Real-Time Computing Systems and Applications.

[41]  Björn Andersson,et al.  Provably Good Scheduling of Sporadic Tasks with Resource Sharing on a Two-Type Heterogeneous Multiprocessor Platform , 2011, OPODIS.

[42]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[43]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[44]  Nancy Warter-Perez,et al.  Modulo scheduling with multiple initiation intervals , 1995, MICRO 1995.