Exploring Memory Persistency Models for GPUs

Given its high integration density, high speed, byte addressability, and low standby power, non-volatile or persistent memory is expected to supplement/replace DRAM as main memory. Through persistency programming model (which defines durability ordering of stores) and durable transaction constructs, the programmer can provide recoverable data structure (RDS) which allows programs to recover to a consistent state after a failure. While persistency models have been well studied for CPUs, they have been neglected for graphics processing units (GPUs). Considering the importance of GPUs as a dominant accelerator for high performance computing, we investigate persistency models for GPUs. GPU applications exhibit substantial differences with CPUs applications, hence in this paper we adapt, re-architect, and optimize CPU persistency models for GPUs. We design a pragma-based compiler scheme for expressing persistency model for GPUs. We identify that the thread hierarchy in GPUs offers intuitive scopes to form epochs and durable transactions. We find that undo logging produces significant performance overheads. We propose to use idempotency analysis to reduce both logging frequency and the size of logs. Through both real-system and simulation evaluations, we show low overheads of our proposed architecture support.

[1]  Karthikeyan Sankaralingam,et al.  iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[2]  Dibakar Gope,et al.  A Case for Scoped Persist Barriers in GPUs , 2018, GPGPU@PPoPP.

[3]  David A. Wood,et al.  Lazy release consistency for GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Qingrui Liu,et al.  Compiler-Directed Failure Atomicity for Nonvolatile Memory , 2019 .

[5]  Stratis Viglas,et al.  ATOM: Atomic Durability in Non-volatile Memory through Hardware Logging , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[6]  Mieszko Lis,et al.  Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  Muhammad A. Awad,et al.  Engineering a high-performance GPU B-Tree , 2019, PPoPP.

[8]  Michael L. Scott,et al.  iDO: Compiler-Directed Failure Atomicity for Nonvolatile Memory , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Terence Kelly Persistent Memory Programming on Conventional Hardware , 2019, ACM Queue.

[10]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[11]  Satish Narayanasamy,et al.  Efficiently enforcing strong memory ordering in GPUs , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Thomas F. Wenisch,et al.  High-Performance Transactions for Persistent Memories , 2016, ASPLOS.

[13]  Thomas F. Wenisch,et al.  Memory persistency , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[14]  Yan Solihin,et al.  Proteus: A Flexible and Fast Software Supported Hardware Logging approach for NVM , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Michael M. Swift,et al.  Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.

[16]  Yan Solihin,et al.  Lazy Persistency: A High-Performing and Write-Efficient Software Persistency Technique , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[17]  Ada Gavrilovska,et al.  HeteroCheckpoint: Efficient Checkpointing for Accelerator-Based Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[18]  Yan Solihin,et al.  Efficient Checkpointing of Loop-Based Codes for Non-volatile Main Memory , 2017, 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Yuan Yuan,et al.  Mega-KV: A Case for GPUs to Maximize the Throughput of In-Memory Key-Value Stores , 2015, Proc. VLDB Endow..

[20]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[21]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[22]  Michael M. Swift,et al.  An Analysis of Persistent Memory Use with WHISPER , 2017, ASPLOS.

[23]  Thomas F. Wenisch,et al.  Memory Persistency: Semantics for Byte-Addressable Nonvolatile Memory Technologies , 2015, IEEE Micro.