Optimistic Parallelization of Floating-Point Accumulation

Floating-point arithmetic is notoriously nonassociative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6times with randomly generated data and 3-7times with summations extracted from Conjugate Gradient benchmarks.

[1]  Jack Sklansky,et al.  Conditional-Sum Addition Logic , 1960, IRE Trans. Electron. Comput..

[2]  Peter Linz,et al.  Accurate floating-point summation , 1970, CACM.

[3]  Harold S. Stone,et al.  A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[4]  H. T. Kung,et al.  A Regular Layout for Parallel Adders , 1982, IEEE Transactions on Computers.

[5]  Guido D. Salvucci,et al.  Ieee standard for binary floating-point arithmetic , 1985 .

[6]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[7]  Tack-Don Han,et al.  Fast area-efficient VLSI adders , 1987, 1987 IEEE 8th Symposium on Computer Arithmetic (ARITH).

[8]  D. H. Bartley,et al.  Revised4 report on the algorithmic language scheme , 1991, LIPO.

[9]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[10]  Jonathan Rees,et al.  Revised3 report on the algorithmic language scheme , 1986, SIGP.

[11]  R. Kent Dybvig,et al.  Revised5 Report on the Algorithmic Language Scheme , 1986, SIGP.

[12]  Margaret Martonosi,et al.  Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques , 2000, IEEE Trans. Computers.

[13]  Yozo Hida,et al.  Accurate Floating Point Summation , 2002 .

[14]  Cameron McNairy,et al.  Itanium 2 Processor Microarchitecture , 2003, IEEE Micro.

[15]  Keith D. Underwood,et al.  FPGAs vs. CPUs: trends in peak floating-point performance , 2004, FPGA '04.

[16]  Wilhelm Oberaigner,et al.  Parallel algorithms for the rounding exact summation of floating point numbers , 1982, Computing.

[17]  André DeHon,et al.  Floating-point sparse matrix-vector multiply for FPGAs , 2005, FPGA '05.

[18]  Karl S. Hemmert,et al.  Open Source High Performance Floating-Point Modules , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[19]  Nachiket Kapre,et al.  Packet Switched vs. Time Multiplexed FPGA Overlay Networks , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.