Improvements in Hardware Transactional Memory for GPU Architectures

In the multi-core CPU world, transactional memory (TM) has emerged as an alternative to lock-based programming for thread synchronization. Recent research proposes the use of TM in GPU architectures, where a high number of computing threads, organized in SIMT fashion, requires an effective synchronization method. In contrast to CPUs, GPUs offer two memory spaces: global memory and local memory. The local memory space serves as a shared scratch-pad for a subset of the computing threads, and it is used by programmers to speed-up their applications thanks to its low latency. Prior work from the authors proposed a lightweight hardware TM (HTM) support based in the local memory, modifying the SIMT execution model and adding a conflict detection mechanism. An efficient implementation of these features is key in order to provide an effective synchronization mechanism at the local memory level. After a quick description of the main features of our HTM design for GPU local memory, in this work we gather together a number of proposals designed with the aim of improving those mechanisms with high impact on performance. Firstly, the SIMT execution model is modified to increase the parallelism of the application when transactions must be serialized in order to make forward progress. Secondly, the conflict detection mechanism is optimized depending on application characteristics, such us the read/write sets, the probability of conflict between transactions and the existence of read-only transactions. As these features can be present in hardware simultaneously, it is a task of the compiler and runtime to determine which ones are more important for a given application. This work includes a discussion on the analysis to be done in order to choose the best configuration solution.

[1]  Antonia Zhai,et al.  Lightweight Software Transactions on GPUs , 2014, 2014 43rd International Conference on Parallel Processing.

[2]  Jeffrey T. Draper,et al.  Improving Utilization of Hardware Signatures in Transactional Memory , 2013, IEEE Transactions on Parallel and Distributed Systems.

[3]  Ken R. Duffy,et al.  Complexity analysis of a decentralised graph colouring algorithm , 2008, Inf. Process. Lett..

[4]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Andrew Brownsword,et al.  Hardware transactional memory for GPU architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[7]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[8]  Tor M. Aamodt,et al.  Energy efficient GPU transactional memory via space-time optimizations , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  David Kaeli,et al.  Hardware support for Local Memory Transactions on GPU Architectures , 2015 .

[10]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Philippas Tsigas,et al.  Towards a Software Transactional Memory for Graphics Processors , 2010, EGPGV@Eurographics.

[12]  James R. Larus,et al.  Transactional Memory, 2nd edition , 2010, Transactional Memory.

[13]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[14]  Depei Qian,et al.  Software Transactional Memory for GPU Architectures , 2014, IEEE Computer Architecture Letters.