论文信息 - Low Latency Floating-Point Division and Square Root Unit

Low Latency Floating-Point Division and Square Root Unit

Digit-recurrence algorithms are widely used in actual microprocessors to compute floating-point division and square root. These iterative algorithms present a good trade-off in terms of performance, area and power. We present a floating-point division and square root unit, which implements a radix-64 floating-point division and a radix-16 floating-point square root. To have an affordable implementation, each radix-64 division iteration and radix-16 square root iteration are made of simpler radix-4 iterations: 3 radix-4 iterations in division and 2 in square root. Speculation is used between consecutive radix-4 iterations to get a reduced timing. There are three different parts in digit-recurrence implementations: initialization, digit iterations, and rounding. The digit iteration is the iterative part and it uses the same logic for several cycles. Division and square root share partially the initialization and rounding stages, whereas each one has different logic for the digit iterations. The result is a low-latency floating-point divider and square root, requiring 11, 6, and 4 cycles for double, single and half-precision division with normalized operands and result, and 15, 8 and 5 cycles for square root. One or two additional cycles are needed in case of subnormal operand(s) or result.

Javier D. Bruguera

[1] Dieter Fuhrmann,et al. Logical Effort Designing Fast Cmos Circuits , 2016 .

[2] Warren James,et al. 1 GHz HAL SPARC64/sup R/ Dual Floating Point Unit with RAS features , 2001, Proceedings 15th IEEE Symposium on Computer Arithmetic. ARITH-15 2001.

[3] Alberto Nannarelli. Radix-16 Combined Division and Square Root Unit , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[4] Naofumi Takagi. Powering by a Table Look-Up and a Multiplication with Operand Modification , 1998, IEEE Trans. Computers.

[5] James Coke,et al. Improvements in the Intel CoreTM 2 Penryn Processor Family Architecture and Microarchitecture , 2008 .

[6] Tomás Lang,et al. Simple Radix-4 Division with Opterands Scaling , 1990, IEEE Trans. Computers.

[7] Michael J. Flynn. On Division by Functional Iteration , 1970, IEEE Transactions on Computers.

[8] Silvia M. Müller,et al. Quad Precision Floating Point on the IBM z13 , 2016, 2016 IEEE 23nd Symposium on Computer Arithmetic (ARITH).

[9] Silvia M. Müller,et al. Advanced Clockgating Schemes for Fused-Multiply-Add-Type Floating-Point Units , 2009, 2009 19th IEEE Symposium on Computer Arithmetic.

[11] Israel Koren,et al. Evaluating Elementary Functions in a Numerical Coprocessor Based on Rational Approximations , 1990, IEEE Trans. Computers.

[12] Eric M. Schwarz,et al. High performance floating-point unit with 116 bit wide divider , 2003, Proceedings 2003 16th IEEE Symposium on Computer Arithmetic.

[13] Debjit Das Sarma,et al. Faithful bipartite ROM reciprocal tables , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[14] Milos D. Ercegovac,et al. Improving Goldschmidt Division, Square Root, and Square Root Reciprocal , 2000, IEEE Trans. Computers.

[15] Javier D. Bruguera,et al. Floating-point multiply-add-fused with reduced latency , 2004, IEEE Transactions on Computers.

[16] Javier D. Bruguera. Radix-64 Floating-Point Divider , 2018, 2018 IEEE 25th Symposium on Computer Arithmetic (ARITH).

[17] Stuart F. Oberman,et al. Floating point division and square root algorithms and implementation in the AMD-K7/sup TM/ microprocessor , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).

[18] Michael J. Schulte,et al. Symmetric bipartite tables for accurate function approximation , 1997, Proceedings 13th IEEE Sympsoium on Computer Arithmetic.

[19] Peter-Michael Seidel,et al. The Floating-Point Unit of the Jaguar x86 Core , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[20] M. Ercegovac,et al. Division and Square Root: Digit-Recurrence Algorithms and Implementations , 1994 .

[21] Jean-Michel Muller,et al. Elementary Functions: Algorithms and Implementation , 1997 .

[22] Javier D. Bruguera,et al. Variable Latency Goldschmidt Algorithm Based on a New Rounding Method and a Remainder Estimate , 2011, IEEE Transactions on Computers.

[23] Javier D. Bruguera,et al. High-Speed Double-Precision Computation of Reciprocal, Division, Square Root and Inverse Square Root , 2002, IEEE Trans. Computers.

[24] Sanu Mathew,et al. Split-Path Fused Floating Point Multiply Accumulate (FPMAC) , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.