BifurKTM: Approximately Consistent Distributed Transactional Memory for GPUs

We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Dataand Controlflow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions “appear” to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer. 2012 ACM Subject Classification Computer systems organization → Heterogeneous (hybrid) systems

[1]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[2]  Ye Sun,et al.  Distributed transactional memory for metric-space networks , 2005, Distributed Computing.

[3]  Paolo Romano,et al.  HeTM: Transactional Memory for Heterogeneous Systems , 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[4]  Binoy Ravindran,et al.  Snake: Control Flow Distributed Software Transactional Memory , 2011, SSS.

[5]  Antonia Zhai,et al.  Lightweight Software Transactions on GPUs , 2014, 2014 43rd International Conference on Parallel Processing.

[6]  Andrew Brownsword,et al.  Hardware transactional memory for GPU architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Maurice Herlihy,et al.  CUDA-DTM: Distributed Transactional Memory for GPU Clusters , 2019, NETYS.

[8]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[9]  Alejandro Villegas,et al.  Toward a software transactional memory for heterogeneous CPU–GPU processors , 2018, The Journal of Supercomputing.

[10]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[11]  Costas Busch,et al.  Approximate Consistency in Transactional Memory , 2018, Int. J. Netw. Comput..

[12]  Torvald Riegel,et al.  Time-Based Software Transactional Memory , 2010, IEEE Transactions on Parallel and Distributed Systems.

[13]  Roberto Palmieri,et al.  HyflowCPP: A Distributed Transactional Memory Framework for C++ , 2013, 2013 IEEE 12th International Symposium on Network Computing and Applications.

[14]  Philippas Tsigas,et al.  Towards a Software Transactional Memory for Graphics Processors , 2010, EGPGV@Eurographics.

[15]  Lu Peng,et al.  Efficient GPU hardware transactional memory through early conflict resolution , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[16]  Lu Peng,et al.  Accelerating GPU hardware transactional memory with snapshot isolation , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).