Lightweight Hardware Transactional Memory for GPU Scratchpad Memory

Graphics Processing Units (GPUs) have become the accelerator of choice for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction - Multiple Thread (SIMT) fashion. Using OpenCL terminology, GPUs offer a global memory space shared by all the threads in the GPU, as well as a local memory space shared by only a subset of the threads. Programmers can use local memory as a scratchpad to improve the performance of their applications due to its lower latency as compared to global memory. In the SIMT execution model, data locking mechanisms used to protect shared data limit scalability. To take full advantage of the lower latency that local memory affords, and to provide an efficient synchronization mechanism, we propose GPU-LocalTM as a lightweight and efficient transactional memory (TM) for GPU local memory. To minimize the storage resources required for TM support, GPU-LocalTM allocates transactional metadata in the existing memory resources. Additionally, GPU-LocalTM implements different conflict detection mechanisms that can be used to match the characteristics of the application. For the workloads studied in our simulation-based evaluation, GPU-LocalTM provides from 1.1X up to 100X speedup over serialized critical sections.

[1]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[2]  Tor M. Aamodt,et al.  Energy efficient GPU transactional memory via space-time optimizations , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Depei Qian,et al.  Lock-based synchronization for GPU architectures , 2016, Conf. Computing Frontiers.

[4]  Philippas Tsigas,et al.  Towards a Software Transactional Memory for Graphics Processors , 2010, EGPGV@Eurographics.

[5]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Lu Peng,et al.  Efficient GPU hardware transactional memory through early conflict resolution , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7]  Hagit Attiya,et al.  R EL STM : A Proactive Transactional Memory Scheduler ∗ , 2013 .

[8]  Henk Corporaal,et al.  Fine-Grained Synchronizations and Dataflow Programming on GPUs , 2015, ICS.

[9]  Keshav Pingali,et al.  Morph algorithms on GPUs , 2013, PPoPP '13.

[10]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[11]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[12]  Graham Morgan,et al.  PR-STM: Priority Rule Based Software Transactions for the GPU , 2015, Euro-Par.

[13]  Depei Qian,et al.  Software Transactional Memory for GPU Architectures , 2014, IEEE Computer Architecture Letters.

[14]  James R. Larus,et al.  Transactional Memory, 2nd edition , 2010, Transactional Memory.

[15]  Andrew Brownsword,et al.  Hardware transactional memory for GPU architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Jeffrey T. Draper,et al.  Improving Utilization of Hardware Signatures in Transactional Memory , 2013, IEEE Transactions on Parallel and Distributed Systems.

[17]  Emilio L. Zapata,et al.  Leveraging irrevocability to deal with signature saturation in hardware transactional memory , 2016, The Journal of Supercomputing.

[18]  Carole Dulong,et al.  The IA-64 Architecture at Work , 1998, Computer.

[19]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[20]  David R. Kaeli,et al.  HQL: A Scalable Synchronization Mechanism for GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[21]  Brucek Khailany,et al.  CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[22]  Ilija Basicevic,et al.  Transaction scheduling for Software Transactional Memory , 2017, 2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

[23]  Hsien-Hsin S. Lee,et al.  Adaptive transaction scheduling for transactional memory systems , 2008, SPAA '08.

[24]  Antonia Zhai,et al.  Lightweight Software Transactions on GPUs , 2014, 2014 43rd International Conference on Parallel Processing.

[25]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.