Can PCM Benefit GPU? Reconciling Hybrid Memory Design with GPU Massive Parallelism for Energy Efficiency

In recent studies, phase changing memory (PCM) has shown promising energy efficiency for systems with a modest level of parallelism. But it remains an open question whether it can benefit GPU-like massively parallel systems. This work conducts the first systematic investigation into this question. It empirically shows that contrary to the promising results shown before on CPU, the previous designs of PCM-based memory result in significant degradation to the energy efficiency of GPU computing. It reveals that the fundamental reason comes from a mismatch between those designs and the massive parallelism in GPU. It further shows that fixing the mismatch requires innovations in both hardware and software support. It introduces a set of new hardware features and a novel compiler-directed data placement scheme to address the issues. Working hand-in-hand, they tap into the full potential of hybrid memory for GPU, yielding 15.6% and 40.1% energy saving on average comparing to pure DRAM and pure PCM respectively, and keeping performance loss less than 3.9%.

[1]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[2]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines , 2009, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines.

[3]  Tao Li,et al.  Exploring Phase Change Memory and 3D Die-Stacking for Power/Thermal Friendly, Fast and Durable Memory Architectures , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[4]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[5]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[6]  Rachata Ausavarungnirun,et al.  DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories , 2011 .

[7]  Ricardo Bianchini,et al.  Page placement in hybrid memory systems , 2011, ICS '11.

[8]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[9]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[10]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[11]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[12]  Ken Kennedy,et al.  Automatic Data Layout Using 0-1 Integer Programming , 1994, IFIP PACT.

[13]  Luiz André Barroso,et al.  The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition , 2013, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition.

[14]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[15]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[16]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[17]  Alvin R. Lebeck,et al.  Power aware page allocation , 2000, SIGP.

[18]  Yuanyuan Zhou,et al.  DMA-aware memory energy management , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[19]  Ashok Kumar,et al.  An 8-Core 64-Thread 64b Power-Efficient SPARC SoC , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[20]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[21]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[22]  H. Howie Huang,et al.  Energy-aware writes to non-volatile main memory , 2011, OPSR.

[23]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[24]  Kang G. Shin,et al.  Design and Implementation of Power-Aware Virtual Memory , 2003, USENIX ATC, General Track.

[25]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[26]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[27]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[28]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[29]  Timothy Johnson,et al.  An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.

[30]  Parijat Dube,et al.  Architectural design for next generation heterogeneous memory systems , 2010, 2010 IEEE International Memory Workshop.

[31]  Jian Li,et al.  Power-performance considerations of parallel computing on chip multiprocessors , 2005, TACO.