Efficient GPU hardware transactional memory through early conflict resolution

It has been proposed that Transactional Memory be added to Graphics Processing Units (GPUs) in recent years. One proposed hardware design, Warp TM, can scale to 1000s of concurrent transactions. As a programming method that can atomicize an arbitrary number of memory access locations and greatly reduce the efforts to program parallel applications, transactional memory handles the complexity of inter-thread synchronization. However, when thousands of transactions run concurrently on a GPU, conflicts and resource contentions arise, causing performance loss. In this paper, we identify and analyze the cause of conflicts and contentions and propose two enhancements that try to resolve conflicts early: (1) Early-Abort global conflict resolution that allows conflicts to be detected before they reach the Commit Units so that contention in the Commit Units is reduced and (2) Pause-and-Go execution scheme that reduces the chance of conflict and the performance penalty of re-executing long transactions. These two enhancements are enabled by a single hardware modification. Our evaluation shows the combination of the two enhancements greatly improves overall execution speed while reducing energy consumption.

[1]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[2]  Maurice Herlihy,et al.  Software transactional memory for dynamic-sized data structures , 2003, PODC '03.

[3]  Michael L. Scott,et al.  Conflict Reduction in Hardware Transactions Using Advisory Locks , 2015, SPAA.

[4]  Maurice Herlihy,et al.  A flexible framework for implementing software transactional memory , 2006, OOPSLA '06.

[5]  Mateo Valero,et al.  TagTM - accelerating STMs with hardware tags for fast meta-data access , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[6]  David A. Wood,et al.  Performance Pathologies in Hardware Transactional Memory , 2007, IEEE Micro.

[7]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[8]  Dan Wu,et al.  Lowering the Overhead of Hybrid Transactional Memory with Transact Cache , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[9]  Tor M. Aamodt,et al.  Energy efficient GPU transactional memory via space-time optimizations , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[10]  Helen Cameron,et al.  Lock-Free Red-Black Trees Using CAS , 2011 .

[11]  David Eisenstat,et al.  Lowering the Overhead of Nonblocking Software Transactional Memory , 2006 .

[12]  Antonia Zhai,et al.  Lightweight Software Transactions on GPUs , 2014, 2014 43rd International Conference on Parallel Processing.

[13]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[14]  Maged M. Michael,et al.  RingSTM: scalable transactions with a single atomic instruction , 2008, SPAA '08.

[15]  Prabhakar Misra,et al.  Performance Evaluation of Concurrent Lock-free Data Structures on GPUs , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[16]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[17]  Michael L. Scott,et al.  Flexible Decoupled Transactional Memory Support , 2008, 2008 International Symposium on Computer Architecture.

[18]  Andrew Brownsword,et al.  Hardware transactional memory for GPU architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Per Stenström,et al.  Classification and Elimination of Conflicts in Hardware Transactional Memory Systems , 2011, 2011 23rd International Symposium on Computer Architecture and High Performance Computing.

[20]  Depei Qian,et al.  Software Transactional Memory for GPU Architectures , 2014, IEEE Computer Architecture Letters.