A hybrid approach to cache management in heterogeneous CPU-FPGA platforms

Heterogenous computing is gaining increasing attention due to its promise of high performance with low power. Shared coherent cache based CPU-FPGA platforms, like Intel HARP, are a particularly promising example of such systems with enhanced efficiency and high flexibility. In this work, we propose a hybrid strategy that relies on both static analysis of applications and dynamic control of cache based on static analysis to minimize the contention on the FPGA cache in the emerging CPU-FPGA platforms with shared coherent caches. In the static analysis, we analyze memory access patterns of the accelerated kernels on FPGA using reuse distance theory and generate kernel characteristics called Key values. Thereafter, a dynamic scheme for cache bypassing and partitioning control based on these Key values is developed to increase the cache hit rate and improve the performance. We validate our proposed strategy using a system-level architectural simulator for CPU-FPGA heterogeneous computing systems. Experiments show that the proposed strategy can increase the cache hit rate by 22.90% on average and speed up the application by up to 12.52% with negligible area overhead.

[1]  Wei Zhang,et al.  PAAS: A system level simulator for heterogeneous computing architectures , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[2]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[3]  Viktor K. Prasanna,et al.  High Throughput Large Scale Sorting on a CPU-FPGA Heterogeneous Platform , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[4]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[5]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..

[6]  Mateo Valero,et al.  Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Yale N. Patt,et al.  Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[8]  Xu Cheng,et al.  Optimal bypass monitor for high performance last-level caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Gustavo Alonso,et al.  doppioDB: A hardware accelerated database , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[10]  Yan Solihin,et al.  Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..

[11]  John Turek,et al.  Optimal Partitioning of Cache Memory , 1992, IEEE Trans. Computers.

[12]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[13]  Jichuan Chang,et al.  Cooperative cache partitioning for chip multiprocessors , 2007, ICS '07.

[14]  Chyi-Chang Miao,et al.  Compiler managed micro-cache bypassing for high performance EPIC processors , 2002, MICRO.

[15]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .