CAMAS: Static and Dynamic Hybrid Cache Management for CPU-FPGA Platforms

Heterogeneous computing brings the opportunity to catch up with the increasing demands of modern computing tasks. For this purpose, the CPU-FPGA platform is promising due to the high flexibility of FPGA, which enables customization for various computing tasks to boost performance and energy efficiency. Nowadays, shared coherent cache based CPU-FPGA systems (like Intel HARP and IBM POWER8 with CAPI) are proposed to enhance the communication efficiency between CPU and FPGA and simplify the programming model. In such systems, a coherent cache is attached to FPGA for the quick memory access from FPGA, and its behavior dominates the performance of the FPGA and the entire system. However, the FPGA execution tends to encounter severe cache misses on the FPGA cache, which degrades the FPGA acceleration benefits. To solve this problem, we propose CAMAS, a static and dynamic coordinated cache management approach to reduce the FPGA cache misses and enhance the AFU performance. In the static step, reuse distance analysis is applied to the memory access trace from FPGA to characterize the accessed cachelines into three types according to their locality level. Then a dynamic control with a learning mechanism performs bypassing or caching for the returned cachelines at the cache miss according to the corresponding type. Our approach combines compile-time analysis to determine the caching or bypassing preference with the run-time management equipped with a dynamic learning mechanism. Experiments on Polybench applications demonstrate an average performance improvement of 24.92% using CAMAS.

[1]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.

[2]  Won Woo Ro,et al.  Access pattern-aware cache management for improving data utilization in GPU , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[3]  Wei Zhang,et al.  A hybrid approach to cache management in heterogeneous CPU-FPGA platforms , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[4]  Wei Zhang,et al.  PAAS: A system level simulator for heterogeneous computing architectures , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[5]  Gustavo Alonso,et al.  doppioDB: A hardware accelerated database , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[6]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[7]  Jason Cong,et al.  The SMEM Seeding Acceleration for DNA Sequence Alignment , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[8]  Yu Wang,et al.  Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[9]  Mateo Valero,et al.  Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Kristof Beyls,et al.  Reuse Distance as a Metric for Cache Behavior. , 2001 .

[11]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[13]  Wen-mei W. Hwu,et al.  Run-Time Adaptive Cache Hierarchy Management via Reference Analysis , 1997, ISCA.

[14]  Mainak Chaudhuri,et al.  Bypass and insertion algorithms for exclusive last-level caches , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[15]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..