Non-blocking programming on multi-core graphics processors: (extended asbtract)

This paper investigates the synchronization power of coalesced memory accesses, a family of memory access mechanisms introduced in recent large multicore architectures like the CUDA graphics processors. We first design three memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models in terms of their consensus numbers. These tight results show that the coalesced memory access mechanisms can facilitate strong synchronization between the threads of multicore processors, without the need of synchronization primitives other than reads and writes. Moreover, based on the intrinsic features of recent GPU architectures, we construct strong synchronization objects like wait-free and t-resilient read-modify-write objects for a general model of recent GPU architectures without strong hardware synchronization primitives like test-and-set and compare-and-swap. Accesses to the wait-free objects have time complexity O(N), where N is the number of processes. Our result demonstrates that it is possible to construct waitfree synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.

[1]  Gary L. Peterson,et al.  Concurrent Reading While Writing , 1983, TOPL.

[2]  Leslie Lamport,et al.  Concurrent reading and writing , 1977, Commun. ACM.

[3]  Jonas Larsson,et al.  Space Time Adaptive Processing Estimates for IBM/Sony/Toshiba Cell Broadband Engine Processor , 2006, 2006 International Radar Symposium.

[4]  Marcin Paprzycki,et al.  Distributed Computing: Fundamentals, Simulations and Advanced Topics , 2001, Scalable Comput. Pract. Exp..

[5]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[6]  Alessandro Panconesi,et al.  On the importance of having an identity or, is consensus really universal? , 2005, Distributed Computing.

[7]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[8]  Maurice Herlihy,et al.  Bounded round number , 1993, PODC '93.

[9]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[10]  Philippas Tsigas,et al.  The Synchronization Power of Coalesced Memory Accesses , 2010, IEEE Transactions on Parallel and Distributed Systems.

[11]  Eric Ruppert Determining Consensus Numbers , 2000, SIAM J. Comput..

[12]  Maurice Herlihy,et al.  Wait-free synchronization , 1991, TOPL.

[13]  David E. Culler,et al.  Managing concurrent access for shared memory active messages , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[14]  Sam Toueg,et al.  Generalized Irreducibility of Consensus and the Equivalence of t-Resilient and Wait-Free Implementations of Consensus , 2004, SIAM J. Comput..

[15]  V. Rich Personal communication , 1989, Nature.

[16]  Yi Zhang,et al.  Integrating non-blocking synchronisation in parallel applications: performance advantages and methodologies , 2002, WOSP '02.

[17]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[18]  Danny Dolev,et al.  On the minimal synchronism needed for distributed consensus , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[19]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[20]  Maurice Herlihy,et al.  Randomized wait-free concurrent objects (extended abstract) , 1991, PODC '91.

[21]  Hagit Attiya,et al.  Distributed Computing: Fundamentals, Simulations and Advanced Topics , 1998 .

[22]  Yi Zhang,et al.  Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors , 2001, SIGMETRICS '01.

[23]  Eli Gafni,et al.  Generalized FLP impossibility result for t-resilient asynchronous computations , 1993, STOC.

[24]  Maged M. Michael,et al.  Relative performance of preemption-safe locking and non-blocking synchronization on multiprogrammed shared memory multiprocessors , 1997, Proceedings 11th International Parallel Processing Symposium.

[25]  Philippas Tsigas,et al.  Wait-free Programming for General Purpose Computations on Graphics Processors , 2008, IPDPS.