Efficient Inspected Critical Sections in Data-Parallel GPU Codes

Optimistic concurrency control and STMs rely on the assumption of sparse conflicts. For data-parallel GPU codes with many or with dynamic data dependences, a pessimistic and lock-based approach may be faster, if only GPUs would offer hardware support for GPU-wide fine-grained synchronization. Instead, current GPUs inflict dead- and livelocks on attempts to implement such synchronization in software.

[1]  Depei Qian,et al.  Lock-based synchronization for GPU architectures , 2016, Conf. Computing Frontiers.

[2]  Henk Corporaal,et al.  Fine-Grained Synchronizations and Dataflow Programming on GPUs , 2015, ICS.

[3]  Maged M. Michael,et al.  Software Transactional Memory: Why Is It Only a Research Toy? , 2008, ACM Queue.

[4]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[5]  Philippas Tsigas,et al.  Towards a Software Transactional Memory for Graphics Processors , 2010, EGPGV@Eurographics.

[6]  Joel H. Saltz,et al.  Run-Time Parallelization and Scheduling of Loops , 1991, IEEE Trans. Computers.

[7]  Arun Ramamurthy,et al.  Towards scalar synchronization in SIMT architectures , 2011 .

[8]  Graham Morgan,et al.  PR-STM: Priority Rule Based Software Transactions for the GPU , 2015, Euro-Par.

[9]  Depei Qian,et al.  Software Transactional Memory for GPU Architectures , 2014, IEEE Computer Architecture Letters.

[10]  Anant Agarwal,et al.  Smartlocks: lock acquisition scheduling for self-aware synchronization , 2010, ICAC '10.

[11]  David R. Kaeli,et al.  HQL: A Scalable Synchronization Mechanism for GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[12]  Paul Erdös,et al.  On random graphs, I , 1959 .

[13]  Antonia Zhai,et al.  Lightweight Software Transactions on GPUs , 2014, 2014 43rd International Conference on Parallel Processing.

[14]  Kunle Olukotun,et al.  STAMP: Stanford Transactional Applications for Multi-Processing , 2008, 2008 IEEE International Symposium on Workload Characterization.

[15]  Andrew Brownsword,et al.  Synchronization via scheduling: techniques for efficiently managing shared state , 2011, PLDI '11.

[16]  Alexander Knapp,et al.  On the Correctness of the SIMT Execution Model of GPUs , 2012, ESOP.

[17]  Lawrence Rauchwerger,et al.  Speculative Parallelization of Partially Parallel Loops , 2000, LCR.

[18]  Wu-chun Feng,et al.  On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[19]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[20]  Tor M. Aamodt,et al.  MIMD synchronization on SIMT architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).