Efficiently enforcing strong memory ordering in GPUs

GPU programming models such as CUDA and OpenCL are starting to adopt a weaker data-race-free (DRF-0) memory model, which does not guarantee any semantics for programs with data-races. Before standardizing the memory model interface for GPUs, it is imperative that we understand the tradeoffs of different memory models for these devices. While there is a rich memory model literature for CPUs, studies on architectural mechanisms and performance costs for enforcing memory ordering constraints in GPU accelerators have been lacking. This paper shows that the performance cost of SC and TSO compared to DRF-0 is insignificant for most GPGPU applications, due to warp-level parallelism and in-order execution. For the remaining challenging applications that exhibit significant overhead for SC, we show that commonly employed memory ordering optimizations in CPUs are either expensive or ineffective for GPUs. We propose a GPU-specific non-speculative SC design that takes advantage of high spatial locality and temporally private data in GPU applications. Results show that the proposed design is effective in eliminating the performance gap between SC and DRF-0 in GPUs.

[1]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[2]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[3]  Mikko H. Lipasti,et al.  Atomic SC for simple in-order processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[4]  Scott A. Mahlke,et al.  Mascar: Speeding up GPU warps by reducing memory pitstops , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[5]  Mike O'Connor,et al.  Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[6]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[7]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[9]  Sarita V. Adve,et al.  An evaluation of memory consistency models for shared-memory systems with ILP processors , 1996, ASPLOS VII.

[10]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[11]  Satish Narayanasamy,et al.  End-to-end sequential consistency , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[12]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[13]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[15]  Kunle Olukotun,et al.  Programming with transactional coherence and consistency (TCC) , 2004, ASPLOS XI.

[16]  Margaret Martonosi,et al.  MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[17]  Daniel J. Sorin,et al.  Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.

[18]  Mike O'Connor,et al.  Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[19]  Muli Ben-Yehuda,et al.  IOMMU: strategies for mitigating the IOTLB bottleneck , 2010, ISCA'10.

[20]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[21]  Babak Falsafi,et al.  Reactive NUCA: near-optimal block placement and replication in distributed caches , 2009, ISCA '09.

[22]  Наталія Ігорівна Муліна,et al.  Programming language C , 2013 .

[23]  Mark D. Hill,et al.  Multiprocessors Should Support Simple Memory-Consistency Models , 1998, Computer.

[24]  Thomas F. Wenisch,et al.  Mechanisms for store-wait-free multiprocessors , 2007, ISCA '07.

[25]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.

[26]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[27]  Niraj K. Jha,et al.  GARNET: A detailed on-chip network model inside a full-system simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[28]  Ganesh Gopalakrishnan,et al.  GPU Concurrency: Weak Behaviours and Programming Assumptions , 2015, ASPLOS.

[29]  Happy Gogoi PROGRAMMING LANGUAGE C , 2015 .

[30]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[31]  Babak Falsafi,et al.  Speculative sequential consistency with little custom storage , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[32]  Satish Narayanasamy,et al.  zFENCE: Data-less Coherence for Efficient Fences , 2015, ICS.

[33]  David A. Wood,et al.  Heterogeneous-race-free memory models , 2014, ASPLOS.

[34]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[35]  Gil Neiger,et al.  Intel ® Virtualization Technology for Directed I/O , 2006 .

[36]  Sarita V. Adve,et al.  Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models , 1997, SPAA '97.

[37]  ShashaDennis,et al.  Efficient and correct execution of parallel programs that share memory , 1988 .

[38]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[39]  Abhishek Bhattacharjee,et al.  Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[40]  T. N. Vijaykumar,et al.  Is SC + ILP = RC? , 1999, ISCA.

[41]  Rajiv Gupta,et al.  Efficient sequential consistency via conflict ordering , 2012, ASPLOS XVII.

[42]  Thomas F. Wenisch,et al.  InvisiFence: performance-transparent memory ordering in conventional multiprocessors , 2009, ISCA '09.

[43]  Anoop Gupta,et al.  Two Techniques to Enhance the Performance of Memory Consistency Models , 1991, ICPP.

[44]  Antonio Robles,et al.  Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[45]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).