FPGA implementation of an exact dot product and its application in variable-precision floating-point arithmetic

The current paper explores the capability and flexibility of field programmable gate-arrays (FPGAs) to implement variable-precision floating-point (VP) arithmetic. First, the VP exact dot product algorithm, which uses exact fixed-point operations to obtain an exact result, is presented. A VP multiplication and accumulation unit (VPMAC) on FPGA is then proposed. In the proposed design, the parallel multipliers generate the partial products of mantissa multiplication in parallel, which is the most time-consuming part in the VP multiplication and accumulation operation. This method fully utilizes DSP performance on FPGAs to enhance the performance of the VPMAC unit. Several other schemes, such as two-level RAM bank, carry-save accumulation, and partial summation, are used to achieve high frequency and pipeline throughput in the product accumulation stage. The typical algorithms in Basic Linear Algorithm Subprograms (i.e., vector dot product, general matrix vector product, and general matrix multiply product), LU decomposition, and Modified Gram–Schmidt QR decomposition, are used to evaluate the performance of the VPMAC unit. Two schemes, called the VPMAC coprocessor and matrix accelerator, are presented to implement these applications. Finally, prototypes of the VPMAC unit and the matrix accelerator based on the VPMAC unit are created on a Xilinx XC6VLX760 FPGA chip.Compared with a parallel software implementation based on OpenMP running on an Intel Xeon Quad-core E5620 CPU, the VPMAC coprocessor, equipped with one VPMAC unit, achieves a maximum acceleration factor of 18X. Moreover, the matrix accelerator, which mainly consists of a linear array of eight processing elements, achieves 12X–65X better performance.

[1]  Earl E. Swartzlander,et al.  Hardware design and arithmetic algorithms for a variable-precision, interval arithmetic coprocessor , 1995, Proceedings of the 12th Symposium on Computer Arithmetic.

[2]  Yong Dou,et al.  FPGA-Specific Custom VLIW Architecture for Arbitrary Precision Floating-Point Arithmetic , 2011, IEICE Trans. Inf. Syst..

[3]  Tony M. Carter Cascade: hardware for high/variable precision arithmetic , 1989, Proceedings of 9th Symposium on Computer Arithmetic.

[4]  Earl E. Swartzlander,et al.  A Family of Variable-Precision Interval Arithmetic Processors , 2000, IEEE Trans. Computers.

[5]  Javier Hormigo,et al.  Interval sine and cosine functions computation based on variable-precision CORDIC algorithm , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).

[6]  Siegfried M. Rump,et al.  Algorithms for verified inclusions—theory and practice , 1988 .

[7]  A. Knofel,et al.  Fast hardware units for the computation of accurate dot products , 1991, [1991] Proceedings 10th IEEE Symposium on Computer Arithmetic.

[8]  George A. Constantinides,et al.  Accurate Floating Point Arithmetic through Hardware Error-Free Transformations , 2011, ARC.

[9]  Javier Hormigo,et al.  A Hardware Algorithm for Variable � Precision Division , 2003 .

[10]  Guillaume Melquiond,et al.  IEEE Interval Standard Working Group - P1788: Current Status , 2009, 2009 19th IEEE Symposium on Computer Arithmetic.

[11]  David H. Bailey,et al.  High-precision floating-point arithmetic in scientific computation , 2004, Computing in Science & Engineering.

[12]  Ulrich W. Kulisch,et al.  Very fast and exact accumulation of products , 2011, Computing.

[13]  Chris H. Q. Ding,et al.  Using Accurate Arithmetics to Improve Numerical Reproducibility and Stability in Parallel Applications , 2000, ICS '00.

[14]  Milos D. Ercegovac,et al.  A variable long-precision arithmetic unit design for reconfigurable coprocessor architectures , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[15]  E. L. Zapata,et al.  FPGA Implementation of a Variable Precision CORDIC Processor , 1998 .

[16]  Javier Hormigo,et al.  CORDIC Processor for Variable-Precision Interval Arithmetic , 2004, J. VLSI Signal Process..

[17]  Yong Dou,et al.  A unified co-processor architecture for matrix decomposition , 2010 .

[18]  George A. Constantinides,et al.  A Fused Hybrid Floating-Point and Fixed-Point Dot-Product for FPGAs , 2010, ARC.

[19]  Yong Dou,et al.  FPGA accelerating double/quad-double high precision floating-point applications for ExaScale computing , 2010, ICS '10.

[20]  Thomas E. Hull,et al.  CADAC: A Controlled-Precision Decimal Arithmetic Unit , 1983, IEEE Transactions on Computers.

[21]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[22]  Mei Han An,et al.  accuracy and stability of numerical algorithms , 1991 .

[23]  Vincent Lefèvre,et al.  MPFR: A multiple-precision binary floating-point library with correct rounding , 2007, TOMS.

[24]  Donald M. Chiarulli,et al.  DRAFT: A dynamically reconfigurable processor for integer arithmetic , 1985, 1985 IEEE 7th Symposium on Computer Arithmetic (ARITH).

[25]  J. Fujimoto,et al.  High precision numerical computations A case for an HAPPY design , 2005 .

[26]  Tarek A. El-Ghazawi,et al.  Bringing High-Performance Reconfigurable Computing to Exact Computations , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[27]  Javier Hormigo,et al.  A hardware algorithm for variable-precision logarithm , 2000, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors.

[28]  Ulrich W. Kulisch,et al.  Computer Arithmetic and Validity - Theory, Implementation, and Applications , 2008, de Gruyter studies in mathematics.

[29]  Florent de Dinechin,et al.  An FPGA-specific approach to floating-point accumulation and sum-of-products , 2008, 2008 International Conference on Field-Programmable Technology.

[30]  Ulrich W. Kulisch,et al.  The exact dot product as basic tool for long interval arithmetic , 2011, Computing.

[31]  Wolfgang Rülling,et al.  Exact accumulation of floating-point numbers , 1991, IEEE Symposium on Computer Arithmetic.

[32]  James Demmel,et al.  Design, implementation and testing of extended and mixed precision BLAS , 2000, TOMS.

[33]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[34]  Keith D. Underwood,et al.  FPGAs vs. CPUs: trends in peak floating-point performance , 2004, FPGA '04.

[35]  Lambert Spaanenburg,et al.  PROCEEDINGS OF THE 12TH SYMPOSIUM ON COMPUTER ARITHMETIC , 1995 .