Area and performance tradeoffs in floating-point divide and square-root implementations

Floating-point divide and square-root operations are essential to many scientific and engineering applications, and are required in all computer systems that support the IEEE floating-point standard. Yet many current microprocessors provide only weak support for these operations. The latency and throughput of division are typically far inferior to those of floating-point addition and multiplication, and square-root performance is often even lower. This article argues the case for high-performance division and square root. It also explains the algorithms and implementations of the primary techniques, subtractive and multiplicative methods, employed in microprocessor floating-point units with their associated area/performance tradeoffs. Case studies of representative floating-point unit configurations are presented, supported by simulation results using a carefully selected benchmark, Givens rotation, to show the dynamic performance impact of the various implementation alternatives. The topology of the implementation is found to be an important performance factor. Multiplicative algorithms, such as the Newton-Raphson method and Goldschmidt's algorithm, can achieve low latencies. However, these implementations serialize multiply, divide, and square root operations through a single pipeline, which can lead to low throughput. While this hardware sharing yields low size requirements for baseline implementations, lower-latency versions require many times more area. For these reasons, multiplicative implementations are best suited to cases where subtractive methods are precluded by area constraints, and modest performance on divide and square root operations is tolerable. Subtractive algorithms, exemplified by radix-4 SRT and radix-16 SRT, can be made to execute in parallel with other floating-point operations.

[1]  H. Sharangpani,et al.  Statistical Analysis of Floating Point Flaw in the Pentium Processor , 1994 .

[2]  Jeff Yetter,et al.  Performance features of the PA7100 microprocessor , 1993, IEEE Micro.

[3]  Michael J. Flynn,et al.  An Analysis of Division Algorithms and Implementations , 1995 .

[4]  Tomás Lang,et al.  Very high radix division with selection by rounding and prescaling , 1993, Proceedings of IEEE 11th Symposium on Computer Arithmetic.

[5]  Peter W. Markstein Computation of Elementary Functions on the IBM RISC System/6000 Processors , 1990, IBM J. Res. Dev..

[6]  Donald B. Alpert,et al.  Architecture of the Pentium microprocessor , 1993, IEEE Micro.

[7]  C. C. Stearns Subtractive floating-point division and square root for VLSI DSP , 1989 .

[8]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[9]  David W. Matula,et al.  A 17 /spl times/ 69 bit multiply and add unit with redundant binary feedback and single cycle latency , 1993, Proceedings of IEEE 11th Symposium on Computer Arithmetic.

[10]  S. F. Anderson,et al.  The IBM system/360 model 91: floating-point execution unit , 1967 .

[11]  P. Y. Lu,et al.  A VLSI module for IEEE floating-point multiplication/division/square root , 1989, Proceedings 1989 IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[12]  Akira Miyoshi,et al.  Accurate Ronding Scheme for the Newton-Raphson Method Using Redundant Binary Representation , 1994, IEEE Trans. Computers.

[13]  S. Peter Song,et al.  The PowerPC 604 RISC microprocessor. , 1994, IEEE Micro.

[14]  M. Ercegovac,et al.  Division and Square Root: Digit-Recurrence Algorithms and Implementations , 1994 .

[15]  Charles R. Moore,et al.  The Power PC 601 microprocessor , 1993, IEEE Micro.

[16]  Larry Yang,et al.  The TMS390C602A floating-point coprocessor for Sparc systems , 1990, IEEE Micro.

[17]  George Cybenko,et al.  Scientific benchmark characterizations , 1991, Parallel Comput..

[18]  Debjit Das Sarma,et al.  Measuring the accuracy of ROM reciprocal tables , 1993, Proceedings of IEEE 11th Symposium on Computer Arithmetic.

[19]  Leslie Kohn,et al.  Introducing the Intel i860 64-bit microprocessor , 1989, IEEE Micro.

[20]  Reinhold Weicker,et al.  A detailed look at some popular benchmarks , 1991, Parallel Comput..

[21]  Tomás Lang,et al.  Very-high radix combined division and square root with prescaling and selection by rounding , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[22]  Ansi Ieee,et al.  IEEE Standard for Binary Floating Point Arithmetic , 1985 .

[23]  John V. McCanny,et al.  A VLSI architecture for multiplication, division and square root , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[24]  Damiel E. Atkins Higher-Radix Division Using Estimates of the Divisor and Partial Remainders , 1968, IEEE Transactions on Computers.

[25]  Gene H. Golub,et al.  Matrix computations , 1983 .

[26]  Thomas M. Conte,et al.  Architectural resource requirements of contemporary benchmarks: a wish list , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[27]  Kaivalya M. Dixit,et al.  The SPEC benchmarks , 1991, Parallel Comput..

[28]  Norman R. Scott Computer Number Systems and Arithmetic , 1984 .

[29]  Debjit Das Sarma,et al.  Faithful bipartite ROM reciprocal tables , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[30]  C. R. Moore The PowerPC 601 microprocessor , 1993, Digest of Papers. Compcon Spring.

[31]  K. J. Ray Liu,et al.  A class of square root and division free algorithms and architectures for QRD-based adaptive signal processing , 1994, IEEE Trans. Signal Process..

[32]  P. Bannon,et al.  Internal architecture of Alpha 21164 microprocessor , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[33]  Marc Tremblay,et al.  UltraSPARC: the next generation superscalar 64-bit SPARC , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[34]  Thomas Gross Software implementation of floating-point arithmetic on a reduced-instruction-set , 1985 .

[35]  Mark Horowitz,et al.  A zero-overhead self-timed 160-ns 54-b CMOS divider , 1991 .

[36]  Steven W. White POWER2: architecture and performance , 1994, Proceedings of COMPCON '94.

[37]  Michael J. Flynn,et al.  Fast Division Using Accurate Quotient Approximations to Reduce the Number of Iterations , 1992, IEEE Trans. Computers.

[38]  Sridhar Samudrala,et al.  On the implementation of shifters, multipliers, and dividers in VLSI floating point units , 1987, 1987 IEEE 8th Symposium on Computer Arithmetic (ARITH).

[39]  John V. McCanny,et al.  New algorithms and VLSI architectures for SRT division and square root , 1993, Proceedings of IEEE 11th Symposium on Computer Arithmetic.

[40]  Michael J. Flynn,et al.  Introduction to Arithmetic for Digital Systems Designers , 1995 .

[41]  George S. Taylor Radix 16 SRT dividers with overlapped quotient selection stages: A 225 nanosecond double precision divider for the S-1 Mark IIB , 1985, 1985 IEEE 7th Symposium on Computer Arithmetic (ARITH).

[42]  G. Blanck,et al.  The SuperSPARC microprocessor , 1992, Digest of Papers COMPCON Spring 1992.

[43]  Nader Vasseghi,et al.  The Mips R4000 processor , 1992, IEEE Micro.

[44]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[45]  D.L. Fowler,et al.  An accurate, high speed implementation of division by reciprocal approximation , 1989, Proceedings of 9th Symposium on Computer Arithmetic.

[46]  E. Juliussen Which low-end workstation? , 1994, IEEE Spectrum.

[47]  Brad Burgess,et al.  The PowerPC 603 microprocessor , 1994, CACM.

[48]  Doug Hunt,et al.  Advanced performance features of the 64-bit PA-8000 , 1995, Digest of Papers. COMPCON'95. Technologies for the Information Superhighway.

[49]  Michael J. Flynn,et al.  Design Issues in Floating-Point Division , 1994 .

[50]  Edward McLellan The Alpha AXP architecture and 21064 processor , 1993, IEEE Micro.

[51]  Ping Tak Peter Tang,et al.  It takes six ones to reach a flaw [Pentium processor] , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[52]  Gensoh Matsubara,et al.  30-ns 55-b shared radix 2 division and square root using a self-timed circuit , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[53]  Erdem Hokenek,et al.  Design of the IBM RISC System/6000 Floating-Point Execution Unit , 1990, IBM J. Res. Dev..

[54]  Steven W. White,et al.  How does processor MHZ relate to end-user performance? I. Pipelines and functional units , 1993, IEEE Micro.