Exploring the Relation between Monolithic 3D L1 GPU Cache Capacity and Warp Scheduling Efficiency

The warp scheduler plays an important role in the GPU for efficient utilization of hardware resources. However, the efficiency of the warp scheduler is often limited by the L1 cache (especially, L1 data cache) capacity; providing large capacity for an L1 cache is challenging due to the increased latency. In this paper, we adopt Monolithic 3D (M3D) technology to design a large capacity L1 cache for GPU performance enhancement, not deteriorating the latency. Our evaluation results show that the M3D L1 cache improves GPU performance by 2.18~2.24× on average, compared to the 2D conventional L1 cache.

[1]  Gabriel H. Loh,et al.  3D-Integrated SRAM Components for High-Performance Microprocessors , 2009, IEEE Transactions on Computers.

[2]  Kevin Skadron,et al.  HotSpot 6.0: Validation, Acceleration and Extension , 2015 .

[3]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  Sung Kyu Lim,et al.  A design tradeoff study with monolithic 3D integration , 2012, Thirteenth International Symposium on Quality Electronic Design (ISQED).

[6]  Joseph Zambreno,et al.  Phase Aware Warp Scheduling: Mitigating Effects of Phase Behavior in GPGPU Applications , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[7]  Farinaz Koushanfar,et al.  An Energy-Efficient Last-Level Cache Architecture for Process Variation-Tolerant 3D Microprocessors , 2015, IEEE Transactions on Computers.

[8]  B. Rajendran,et al.  Low Thermal Budget Processing for Sequential 3-D IC Fabrication , 2007, IEEE Transactions on Electron Devices.

[9]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[10]  Sung Woo Chung,et al.  Architecting large-scale SRAM arrays with monolithic 3D integration , 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[11]  Sung Kyu Lim,et al.  Ultra-high density 3D SRAM cell designs for monolithic 3D integration , 2012, 2012 IEEE International Interconnect Technology Conference.

[12]  Sung Kyu Lim,et al.  Power-performance study of block-level monolithic 3D-ICs considering inter-tier performance variations , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Sung Kyu Lim,et al.  How to Cope with Slow Transistors in the Top-tier of Monolithic 3D ICs: Design Studies and CAD Solutions , 2016, ISLPED.

[14]  Sung Kyu Lim,et al.  Through-silicon-via aware interconnect prediction and optimization for 3D stacked ICs , 2009, SLIP '09.

[15]  Narayanan Vijaykrishnan,et al.  A Monolithic-3D SRAM Design with Enhanced Robustness and In-Memory Computation Support , 2018, ISLPED.

[16]  Sudhakar Yalamanchili,et al.  LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[17]  Quan Chen,et al.  In-growth test for monolithic 3D integrated SRAM , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Mark Horowitz,et al.  A high-speed, low-power 3D-SRAM architecture , 2008, 2008 IEEE Custom Integrated Circuits Conference.

[19]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  John Kim,et al.  iPAWS: Instruction-issue pattern-based adaptive warp scheduling for GPGPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).