AnScalable Matrix Computing Unit Architecture for FPGA, and SCUMO User Design Interface

High dimensional matrix algebra is essential in numerous signal processing and machine learning algorithms. This work describes a scalable square matrix-computing unit designed on the basis of circulant matrices. It optimizes data flow for the computation of any sequence of matrix operations removing the need for data movement for intermediate results, together with the individual matrix operations’ performance in direct or transposed form (the transpose matrix operation only requires a data addressing modification). The allowed matrix operations are: matrix-by-matrix addition, subtraction, dot product and multiplication, matrix-by-vector multiplication, and matrix by scalar multiplication. The proposed architecture is fully scalable with the maximum matrix dimension limited by the available resources. In addition, a design environment is also developed, permitting assistance, through a friendly interface, from the customization of the hardware computing unit to the generation of the final synthesizable IP core. For N × N matrices, the architecture requires N ALU-RAM blocks and performs O ( N 2 ) , requiring N 2 + 7 and N + 7 clock cycles for matrix-matrix and matrix-vector operations, respectively. For the tested Virtex7 FPGA device, the computation for 500 × 500 matrices allows a maximum clock frequency of 346 MHz, achieving an overall performance of 173 GOPS. This architecture shows higher performance than other state-of-the-art matrix computing units.

[1]  Sunil P. Khatri,et al.  Resource and delay efficient matrix multiplication using newer FPGA devices , 2006, GLSVLSI '06.

[2]  M. S. Sutaone,et al.  Systolic architecture for integer point matrix multiplication using FPGA , 2009, 2009 4th IEEE Conference on Industrial Electronics and Applications.

[3]  Marek Wegrzyn,et al.  Hardware implementation of real-time Extreme Learning Machine in FPGA: Analysis of precision, resource occupation and performance , 2016, Comput. Electr. Eng..

[4]  Anil D. Kumbhar,et al.  Designing an accelerated hardware architecture for polynomial matrix multiplications , 2015, 2015 IEEE Bombay Section Symposium (IBSS).

[5]  Waleed H. Abdulla,et al.  Hardware–Software Codesign of Automatic Speech Recognition System for Embedded Real-Time Applications , 2011, IEEE Transactions on Industrial Electronics.

[6]  Eric Monmasson,et al.  Hardware/Software Codesign Guidelines for System on Chip FPGA-Based Sensorless AC Drive Applications , 2013, IEEE Transactions on Industrial Informatics.

[7]  Ching-Che Chung,et al.  FPGA-based accelerator platform for big data matrix processing , 2015, 2015 IEEE International Conference on Electron Devices and Solid-State Circuits (EDSSC).

[8]  Luis Gómez-Chova,et al.  An IP Core and GUI for Implementing Multilayer Perceptron with a Fuzzy Activation Function on Configurable Logic Devices , 2008, J. Univers. Comput. Sci..

[9]  Shelly Ping-Ju Wu,et al.  Design of Application Specific Throughput Processor for Matrix Operations , 2015, 2015 18th International Conference on Network-Based Information Systems.

[10]  Junzhong Shen,et al.  Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs , 2018, 2018 IEEE International Symposium on Circuits and Systems (ISCAS).

[11]  Kermin Fleming,et al.  Hardware Acceleration of Matrix Multiplication on a Xilinx FPGA , 2007, 2007 5th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE 2007).

[12]  Abhisek Ukil,et al.  Development and Implementation of Parameterized FPGA-Based General Purpose Neural Networks for Online Applications , 2011, IEEE Transactions on Industrial Informatics.

[13]  Kaushik Roy,et al.  Fault-Tolerance with Graceful Degradation in Quality: A Design Methodology and its Application to Digital Signal Processing Systems , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[14]  Wenqiang Wang,et al.  A universal FPGA-based floating-point matrix processor for mobile systems , 2014, 2014 International Conference on Field-Programmable Technology (FPT).

[15]  Michel Devy,et al.  FPGA design and implementation of a matrix multiplier based accelerator for 3D EKF SLAM , 2014, 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14).

[16]  Eric S. Chung,et al.  Towards a Universal FPGA Matrix-Vector Multiplication Architecture , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[17]  Taras Iakymchuk,et al.  Versatile Direct and Transpose Matrix Multiplication with Chained Operations: An Optimized Architecture Using Circulant Matrices , 2016, IEEE Transactions on Computers.

[18]  Viktor K. Prasanna,et al.  Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems , 2007, IEEE Transactions on Parallel and Distributed Systems.

[19]  Levent Gurel,et al.  Comparative benchmarking: matrix multiplication on a multicore coprocessor and a GPU , 2015, 2015 Computational Electromagnetics International Workshop (CEM).

[20]  Erik H. D'Hollander High-Level Synthesis Optimization for Blocked Floating-Point Matrix Multiplication , 2017, CARN.

[21]  Ryutaro Himeno,et al.  A Fast Implementation of Matrix-matrix Product in Double-double Precision on NVIDIA C2050 and Application to Semidefinite Programming , 2012, 2012 Third International Conference on Networking and Computing.

[22]  Shuja Ahmad Abbasi,et al.  A proposed FPGA-based parallel architecture for matrix multiplication , 2008, APCCAS 2008 - 2008 IEEE Asia Pacific Conference on Circuits and Systems.

[23]  Siddharth Joshi,et al.  FPGA Based High Performance Double-Precision Matrix Multiplication , 2009, VLSI Design.

[24]  Jiang Jiang,et al.  Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture , 2009, 2009 International Conference on Reconfigurable Computing and FPGAs.

[25]  Piedad Brox Jiménez,et al.  CAD Tools for Hardware Implementation of Embedded Fuzzy Systems on FPGAs , 2013, IEEE Transactions on Industrial Informatics.

[26]  Fei Lei,et al.  Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures , 2018, Electronics.