An Efficient Method of Parallel Multiplication on a Single DSP Slice for Embedded FPGAs

Field-programmable gate arrays (FPGAs) can efficiently implement custom applications via their embedded digital signal processor (DSP) slices, including binary multipliers. An increasing number of binary multipliers belonging to a DSP slice usually demonstrate that it has the capacity to process as many multiplication operations as possible in one clock cycle. In order to fully utilize the DSP resource, in this paper, we propose a novel DSP slice optimization method to achieve parallel multiplication on single DSP slice, namely PMSDS. First, the PMSDS splits multiplicators into two separate parts, i.e., valid bits and vacant bits, using a customized polynomial algebra method. Then, the PMSDS pre-calculates the maximum number of overflow bits combining the above-mentioned polynomial algebra method. Finally, it computes the total multiplicators’ bit numbers and parallel the final multiplicators. We also propose an optimization model to find the best parallel solution according to the performance and precision of a single DSP slice. Moreover, we implement a PMSDS-based matrix multiplication algorithm supporting the computing precision dynamically changing. The experiments based on a large-scale and real-world matrix multiplication show that the PMSDS has better performance in latency and resource utilization than the traditional, add-tree, and full-unroll methods and is more outstanding in frequency and dynamic power consumption comparing with the state-of-the-art methods.

[1]  Peter Zipf,et al.  Optimization of Constant Matrix Multiplication with Low Power and High Throughput , 2017, IEEE Transactions on Computers.

[2]  Wei Zhang,et al.  FDR 2.0: A Low-Power Dynamically Reconfigurable Architecture and Its FinFET Implementation , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Ray C. C. Cheung,et al.  Area-efficient architectures for double precision multiplier on FPGA, with run-time-reconfigurable dual single precision support , 2013, Microelectron. J..

[4]  Jesús Grajal,et al.  A 4096-Point Radix-4 Memory-Based FFT Using DSP Slices , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[5]  Suhaib A. Fahmy,et al.  Multipumping Flexible DSP Blocks for Resource Reduction on Xilinx FPGAs , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Pasi Liljeberg,et al.  NoC-AXI interface for FPGA-based MPSoC platforms , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[7]  Viktor K. Prasanna,et al.  High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[8]  Arnold Schönhage,et al.  Schnelle Multiplikation großer Zahlen , 1971, Computing.

[9]  Suhaib A. Fahmy,et al.  Mapping for Maximum Performance on FPGA DSP Blocks , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[10]  Martin Fürer,et al.  Faster integer multiplication , 2007, STOC '07.

[11]  Mário P. Véstias,et al.  Parallel dot-products for deep learning on FPGA , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[12]  Douglas L. Maskell,et al.  The iDEA DSP Block-Based Soft Processor for FPGAs , 2014, TRETS.

[13]  Inmaculada Pardines,et al.  DSPONE48: A methodology for automatically synthesize HDL focus on the reuse of DSP slices , 2017, J. Parallel Distributed Comput..

[14]  Jason Cong,et al.  Energy Efficiency of Full Pipelining: A Case Study for Matrix Multiplication , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[15]  Kenli Li,et al.  A parallel computing method using blocked format with optimal partitioning for SpMV on GPU , 2018, J. Comput. Syst. Sci..

[16]  Wei Zhang,et al.  Fracturable DSP Block for Multi-context Reconfigurable Architectures , 2017, Circuits Syst. Signal Process..

[17]  Vamsi Boppana,et al.  A 16-nm Multiprocessing System-on-Chip Field-Programmable Gate Array Platform , 2016, IEEE Micro.

[18]  Douglas L. Maskell,et al.  Throughput oriented FPGA overlays using DSP blocks , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Xin Zhou,et al.  An Efficient Implementation of the Gradient-Based Hough Transform Using DSP Slices and Block RAMs on the FPGA , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[20]  Dongdong Chen,et al.  Area- and power-efficient iterative single/double-precision merged floating-point multiplier on FPGA , 2017, IET Comput. Digit. Tech..

[21]  Viktor K. Prasanna,et al.  Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[22]  Weidong Wang,et al.  HACO-F: An Accelerating HLS-Based Floating-Point Ant Colony Optimization Algorithm on FPGA , 2017 .

[23]  Suhaib A. Fahmy,et al.  Minimizing DSP block usage through multi-pumping , 2015, 2015 International Conference on Field Programmable Technology (FPT).

[24]  Kentaro Sano,et al.  FPGA-based Stream Computing for High-Performance N-Body Simulation using Floating-Point DSP Blocks , 2017, HEART.

[25]  Satoru Yamamoto,et al.  FPGA-Based Scalable and Power-Efficient Fluid Simulation using Floating-Point DSP Blocks , 2017, IEEE Transactions on Parallel and Distributed Systems.