Reproducible floating-point atomic addition in data-parallel environment

Floating-point additions in concurrent execution environment are known to be hazardous, as the result depends on the order in which operations are performed. This problem is encountered in data parallel execution environments such as GPUs, where reproducibility involving floating-point atomic addition is challenging. This problem is due to the rounding error or cancellation that appears for each operation, combined with the lack of control over execution order. In this article we propose two solutions to address this problem: work reassignment and fixed-point accumulation. Work reassignment consists in enforcing an execution order that leads to weak reproducibility. Fixed-point accumulation consists in avoiding rounding errors altogether thanks to a long accumulator and enables strong reproducibility.

[1]  Ganesh Gopalakrishnan,et al.  Determinism and Reproducibility in Large-Scale HPC Systems , 2013 .

[2]  James Demmel,et al.  Parallel Reproducible Summation , 2015, IEEE Transactions on Computers.

[3]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[4]  John D. Owens,et al.  Efficient Synchronization Primitives for GPUs , 2011, ArXiv.

[5]  Ulrich W. Kulisch,et al.  Comments on Fast and Exact Accumulation of Products , 2010, PARA.

[6]  Wu-chun Feng,et al.  To GPU synchronize or not GPU synchronize? , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[7]  Stephane Cotin,et al.  EP4A: Software and Computer Based Simulator Research: Development and Outlook SOFA—An Open Source Framework for Medical Simulation , 2007, MMVR.

[8]  David Defour,et al.  Impacting predictability of GPU's , 2014 .

[9]  David Defour,et al.  Full-Speed Deterministic Bit-Accurate Parallel Floating-Point Summation on Multi- and Many-Core Architectures , 2014 .

[10]  Jie Cheng,et al.  CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[11]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[12]  Jason Sanders,et al.  CUDA by example: an introduction to general purpose GPU programming , 2010 .

[13]  Kevin Skadron,et al.  Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[14]  Jean-Michel Muller,et al.  Handbook of Floating-Point Arithmetic (2nd Ed.) , 2018 .

[15]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[16]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.