Cache locking vs. partitioning for real-time computing on integrated CPU-GPU processors

Heterogeneous multicore processors with integrated CPU and GPU cores on the same chip pose new challenges and opportunities for time-predictable resources sharing, which is crucial for hard real-time and safety-critical systems. The shared last-level cache (LLC) can result in a large number of interferences between CPU and GPU (Graphic Processing Units) LLC accesses with diverse patterns and characteristics, thus impacting the performance and time predictability of both CPUs and GPUs. In this paper, we explore cache partitioning, locking and a combination of them to make the LLC time-predictable for integrated CPU-GPUs while achieving high performance. By evaluating these LLC management approaches, we are able to provide real-time systems developers recommendations on the most effective time-predictable LLC designs for heterogeneous CPU-GPU multicore processors.

[1]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[2]  Yun Liang,et al.  Instruction cache locking using temporal reuse profile , 2010, Design Automation Conference.

[3]  Sharad Malik,et al.  Cache modeling for real-time software: beyond direct mapped instruction caches , 1996, 17th IEEE Real-Time Systems Symposium.

[4]  Francisco J. Cazorla,et al.  MLP-Aware Dynamic Cache Partitioning , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[5]  S. Kim,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[6]  Cloyce D. Spradling SPEC CPU2006 benchmark tools , 2007, CARN.

[7]  A. Perles,et al.  Performance analysis of the static use of locking caches , 2002 .

[8]  Tulika Mitra,et al.  Exploring locking & partitioning for predictable shared caches on multi-cores , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[9]  David B. Whalley,et al.  Bounding Pipeline and Instruction Cache Performance , 1999, IEEE Trans. Computers.

[10]  Hyesoon Kim,et al.  TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[11]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[12]  Yi Yang,et al.  CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[13]  Jakob Engblom,et al.  Requirements for and Design of a Processor with Predictable Timing , 2004, Design of Systems with Predictable Behaviour.

[14]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[15]  Lothar Thiele,et al.  Design for Timing Predictability , 2004, Real-Time Systems.

[16]  Chun Jason Xue,et al.  Branch Prediction directed Dynamic instruction Cache Locking for embedded systems , 2013, 2013 IEEE 19th International Conference on Embedded and Real-Time Computing Systems and Applications.

[17]  Hsien-Hsin S. Lee,et al.  COMPASS: a programmable data prefetcher using idle GPU shaders , 2010, ASPLOS XV.

[18]  Antonia Zhai,et al.  Managing shared last-level cache in a heterogeneous multicore processor , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[19]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Ann Gordon-Ross,et al.  Phase-based Cache Locking for Embedded Systems , 2015, ACM Great Lakes Symposium on VLSI.

[21]  Francisco J. Cazorla,et al.  Hardware support for WCET analysis of hard real-time multicore systems , 2009, ISCA '09.

[22]  Rajeev Barua,et al.  Instruction-Cache Locking for Improving Embedded Systems Performance , 2015, ACM Trans. Embed. Comput. Syst..

[23]  G. Edward Suh,et al.  Dynamic Partitioning of Shared Cache Memory , 2004, The Journal of Supercomputing.