GPU-Accelerated Generation of Correctly Rounded Elementary Functions

The IEEE 754-2008 standard recommends the correct rounding of some elementary functions. This requires solving the Table Maker’s Dilemma (TMD), which implies a huge amount of CPU computation time. In this article, we consider accelerating such computations, namely the Lefèvre algorithm on graphics processing units (GPUs), which are massively parallel architectures with a partial single instruction, multiple data execution. We first propose an analysis of the Lefèvre hard-to-round argument search using the concept of continued fractions. We then propose a new parallel search algorithm that is much more efficient on GPUs thanks to its more regular control flow. We also present an efficient hybrid CPU-GPU deployment of the generation of the polynomial approximations required in the Lefèvre algorithm. In the end, we manage to obtain overall speedups up to 53.4 × on one GPU over a sequential CPU execution and up to 7.1 × over a hex-core CPU, which enable a much faster solution of the TMD for the double-precision format.

[1]  Vincent Lefèvre,et al.  Worst cases and lattice reduction , 2003, Proceedings 2003 16th IEEE Symposium on Computer Arithmetic.

[2]  Tianyi David Han,et al.  Reducing branch divergence in GPU programs , 2011, GPGPU-4.

[3]  Arnaud Tisserand,et al.  Toward Correctly Rounded Transcendentals , 1998, IEEE Trans. Computers.

[4]  James Demmel,et al.  IEEE Standard for Floating-Point Arithmetic , 2008 .

[5]  Bingsheng He,et al.  Supporting extended precision on graphics processors , 2010, DaMoN '10.

[6]  N. B. Slater,et al.  Gaps and steps for the sequence nθ mod 1 , 1967, Mathematical Proceedings of the Cambridge Philosophical Society.

[7]  Jean-Michel Muller,et al.  An FPGA architecture for solving the Table Maker's Dilemma , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[8]  Vincent Lefèvre,et al.  An Algorithm that Computes a Lower Bound on the Distance Between a Segment and ℤ2 , 1998, SCAN.

[9]  V. Lefèvre,et al.  Moyens arithmetiques pour un calcul fiable , 2000 .

[10]  Jean-Michel Muller,et al.  Worst cases for correct rounding of the elementary functions in double precision , 2001, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001.

[11]  Jean-Michel Muller,et al.  Handbook of Floating-Point Arithmetic (2nd Ed.) , 2018 .

[12]  Daisuke Takahashi,et al.  Implementation of multiple-precision floating-point arithmetic library for GPU computing , 2011 .

[13]  Vincent Lefèvre,et al.  Searching worst cases of a one-variable function using lattice reduction , 2005, IEEE Transactions on Computers.

[14]  Shubhabrata Sengupta,et al.  Efficient Parallel Scan Algorithms for GPUs , 2011 .

[15]  Stef Graillat,et al.  Towards Solving the Table Maker's Dilemma on GPU , 2012, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[16]  Mukarram Ahmad,et al.  Continued fractions , 2019, Quadratic Number Theory.

[17]  Tony van Ravenstein The Three Gap Theorem (Steinhaus Conjecture) , 1988, Journal of the Australian Mathematical Society. Series A. Pure Mathematics and Statistics.

[18]  Nicolas Brunie,et al.  Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[19]  Richard P. Brent,et al.  Modern Computer Arithmetic , 2010 .

[20]  M. Waldschmidt,et al.  (in English) , 2000 .

[21]  Vincent Lefèvre,et al.  Worst Cases of a Periodic Function for Large Arguments , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).

[22]  Abraham Ziv,et al.  Fast evaluation of elementary mathematical functions with correctly rounded last bit , 1991, TOMS.

[23]  N. B. Slater,et al.  The distribution of the integers N for which {θN} < φ , 1950, Mathematical Proceedings of the Cambridge Philosophical Society.

[24]  Vincent Lefèvre,et al.  New Results on the Distance between a Segment and Z². Application to the Exact Rounding , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[25]  Thomas Ertl,et al.  SIMT Microscheduling: Reducing Thread Stalling in Divergent Iterative Algorithms , 2012, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[26]  Joachim von zur Gathen,et al.  Fast algorithms for Taylor shifts and certain difference equations , 1997, ISSAC.

[27]  Jean-Michel Muller,et al.  Solving the Table Maker’s Dilemma , 2010 .

[28]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .