Experimental Fault-Tolerant Synchronization for Reliable Computation on Graphics Processors

Graphics processors (GPUs) are emerging as a promising platform for highly parallel, compute-intensive, general-purpose computations, which usually need support for inter-process synchronization. Using the traditional lock-based synchronization (e.g. mutual exclusion) makes the computation vulnerable to faults caused by both scientists’ inexperience and hardware transient errors. It is notoriously difficult for scientists to deal with deadlocks when their computation needs to lock many objects concurrently. Hardware transient errors may make a process, which is holding a lock, stop progressing (or crash). While such hardware transient errors are a non-issue for graphics processors used by graphics computation (e.g. an error in a single pixel may not be noticeable), this no longer holds for graphics processors used for scientific computation. Such scientific computation requires a fault-tolerant synchronization mechanism. However, most of the powerful GPUs aimed at high-performance computing (e.g. NVIDIA Tesla series) do not support any strong synchronization primitives like test-andset and compare-and-swap, which are usually used to construct fault-tolerant synchronization mechanisms. This paper presents an experimental study of fault-tolerant synchronization mechanisms for NVIDIA’s Compute Unified Device Architecture (CUDA) without the need of strong synchronization primitives in hardware. We implement a lockfree synchronization mechanism that eliminates lock-related problems like the deadlock and, moreover, can tolerate process crash-failure. We address the experimental issues that arise in the implementation of the mechanism and evaluate its performance on commodity NVIDIA GeForce 8800 graphics cards.