Split-Path Fused Floating Point Multiply Accumulate (FPMAC)

Floating point multiply-accumulate (FPMAC) unitis the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=Exy-Ez = {-2, -1, 0, 1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it.

[1]  Ieee Circuits,et al.  Digest of technical papers , 1984 .

[2]  Gregory B. Zyner,et al.  167 MHz radix-4 floating point multiplier , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[3]  Asim J. Al-Khalili,et al.  A Low Power Approach to Floating Point Adder Design for DSP Applications , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[4]  Javier D. Bruguera,et al.  A novel design of a two operand normalization circuit , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[5]  N. Burgess Flagged prefix adder for dual additions , 1998, Optics & Photonics.

[6]  Cheng-Chew Lim,et al.  Reduced latency IEEE floating-point standard adder architectures , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).

[7]  K.J. Nowka,et al.  1 GHz leading zero anticipator using independent sign-bit determination logic , 2000, 2000 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.00CH37103).

[8]  Alan F. Murray,et al.  IEEE International Solid-State Circuits Conference , 2001 .

[9]  Asim J. Al-Khalili,et al.  A Low Power Approach to Floating Point Adder Design for DSP Applications , 2001, J. VLSI Signal Process..

[10]  Chichyang Chen,et al.  Architectural design of a fast floating-point multiplication-add fused unit using signed-digit addition , 2001, Proceedings Euromicro Symposium on Digital Systems Design.

[11]  T. Lang,et al.  Floating-point fused multiply-add with reduced latency , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[12]  W. Belluomini,et al.  A double precision floating point multiply , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[13]  P.-M. Seidel,et al.  On-line IEEE floating-point multiplication and division for reduced power dissipation , 2004, Conference Record of the Thirty-Eighth Asilomar Conference on Signals, Systems and Computers, 2004..

[14]  Peter-Michael Seidel,et al.  Delay-optimized implementation of IEEE floating-point addition , 2004, IEEE Transactions on Computers.

[15]  Javier D. Bruguera,et al.  Floating-point multiply-add-fused with reduced latency , 2004, IEEE Transactions on Computers.

[16]  E.E. Swartzlander,et al.  Floating-Point Fused Multiply-Add Architectures , 2007, 2007 Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers.

[17]  Eric M. Schwarz,et al.  P6 Binary Floating-Point Unit , 2007, 18th IEEE Symposium on Computer Arithmetic (ARITH '07).