CAMO: A novel cache management organization for GPGPUs

GPGPUs are now commonly used as co-processors of CPUs for the computation of data parallel and throughputintensive algorithms. However, memory available in GPGPUs is limited for many applications of interest; there is a continuous demand for increased memory of such applications. Several techniques like multi-steaming or pinned memory are frequently employed to mitigate these issues to some extent. However, these techniques either suffer from latency overhead or increase programming complexity. GPUdmm uses GPU DRAM as a cache of CPU; key problems in this design are inefficient memory access data-path and tag access overhead. In this context, we present CAMO, a novel cache memory organization for GPGPUs which addresses the limitations of pinned memory technique and GPUdmm. First, it uses GPU DRAM as a victim cache of LLC that improves the performance by delivering data faster to the SMs. Second, it uses ATCache, a CPU based DRAM cache tag management technique. ATCache reduces the number of DRAM cache accesses. We implement CAMO within the GPGPU-Sim framework and show that its average performance — when compared with pinned memory — increases by a factor of 1.87x and the peak performance growth being 4.67x. In addition, CAMO outperforms GPUdmm on an average by a factor of 15.9% and maximum speedup by a factor of 80%.

[1]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[2]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[3]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[4]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[5]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[6]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[7]  김장우,et al.  A fully associative, tagless DRAM cache , 2015 .

[8]  Ram Huggahalli,et al.  Direct cache access for high bandwidth network I/O , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[9]  Jaewon Lee,et al.  GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[10]  R. Manikantan,et al.  Bi-Modal DRAM Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, MICRO 2014.

[11]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[12]  Xin Bi,et al.  High bandwidth memory interface design based on DDR3 SDRAM and FPGA , 2015, 2015 International SoC Design Conference (ISOCC).

[13]  Gabriel H. Loh,et al.  A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[14]  Li Zhao,et al.  Exploring DRAM cache architectures for CMP server platforms , 2007, 2007 25th International Conference on Computer Design.

[15]  R. Govindarajan,et al.  Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[16]  Babak Falsafi,et al.  Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[17]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Tarek S. Abdelrahman,et al.  hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[19]  Cheng-Chieh Huang,et al.  ATCache: Reducing DRAM cache latency via a small SRAM tag cache , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[20]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[21]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[22]  Babak Falsafi,et al.  Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23]  Yan Solihin,et al.  CHOP: Integrating DRAM Caches for CMP Server Platforms , 2011, IEEE Micro.

[24]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).