论文信息 - A Tag Based Random Order Vector Reduction Circuit

A Tag Based Random Order Vector Reduction Circuit

Vector reduction is a very common operation to reduce a vector into a single scalar value in many scientific and engineering application scenarios. Therefore a fast and efficient vector reduction circuit has great significance to the real-time system applications. Usually the pipeline structure is widely adopted to increase the throughput of the vector reduction circuit and achieve maximum efficiency. In this paper, to deal with multiple vectors of variable length in random input sequence, a novel tag based fully pipelined vector reduction circuit is firstly proposed, in which a cache state module is used to queer and update the cache state of each vector. However, when the quantity of the input vector becomes large, a larger cache state module is required, which consumes more combinational logic and lower the operating frequency. To solve this problem, a high speed circuit is proposed in which the input vectors will be divided into several groups and sent into the dedicated cache state circuits, which can improve the operating frequency. Compared with other existing work, the prototype circuit and the improved circuit based on the prototype circuit can achieve the smallest Slices $ {\times }$ us (<80% of the state-of-the-art work) for different input vector lengths. Moreover, both circuits can provide simple and efficient interface whose access timing is similar to that of a RAM. Therefore the circuits can be applied in a greater range.

Ming Wei | Rui Chen | Wenjin Huang | Huangtao Wu | Yihua Huang

[1] DAVID WILSON,et al. The Unified Accumulator Architecture , 2016, ACM Trans. Reconfigurable Technol. Syst..

[2] Jason D. Bakos,et al. A high-performance double precision accumulator , 2009, 2009 International Conference on Field-Programmable Technology.

[3] Viktor K. Prasanna,et al. High-Performance Reduction Circuits Using Deeply Pipelined Operators on FPGAs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[4] Henk J. Sips,et al. An Improved Vector-Reduction Method , 1991, IEEE Trans. Computers.

[5] Marek Cieplucha. High performance FPGA-based implementation of a parallel multiplier-accumulator , 2013, Proceedings of the 20th International Conference Mixed Design of Integrated Circuits and Systems - MIXDES 2013.

[6] Kai Hwang,et al. Vector-Reduction Techniques for Arithmetic Pipelines , 1985, IEEE Transactions on Computers.

[7] Wayne Luk,et al. dfesnippets: An Open-Source Library for Dataflow Acceleration on FPGAs , 2017, ARC.

[8] Peter M. Kogge,et al. The Architecture of Pipelined Computers , 1981 .

[9] Kleanthis Psarris,et al. Accelerating Matrix Operations with Improved Deeply Pipelined Vector Reduction , 2012, IEEE Transactions on Parallel and Distributed Systems.

[10] Dongdong Chen,et al. High performance and energy efficient single-precision and double-precision merged floating-point adder on FPGA , 2018, IET Comput. Digit. Tech..

[11] Mi Lu,et al. Group-Alignment based Accurate Floating-Point Summation on FPGAs , 2006, ERSA.

[12] Peter Zipf,et al. Optimization of Constant Matrix Multiplication with Low Power and High Throughput , 2017, IEEE Transactions on Computers.

[13] Peter Zipf,et al. Constant Matrix Multiplication with Ternary Adders , 2018, 2018 25th IEEE International Conference on Electronics, Circuits and Systems (ICECS).

[14] Anastasios I. Dounis,et al. An Efficient FPGA Implementation of the Big Bang-Big Crunch Optimization Algorithm , 2018, ARC.

[15] A. Alvandpour,et al. A 6.2-GFlops Floating-Point Multiply-Accumulator With Conditional Normalization , 2006, IEEE Journal of Solid-State Circuits.

[16] Ming Wei,et al. A tag based vector reduction circuit , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[17] Miaoqing Huang,et al. Modular Design of Fully Pipelined Reduction Circuits on FPGAs , 2013, IEEE Transactions on Parallel and Distributed Systems.

[18] Margaret Martonosi,et al. Accelerating Pipelined Integer and Floating-Point Accumulations in Configurable Hardware with Delayed Addition Techniques , 2000, IEEE Trans. Computers.