Implementation of Middle Product Algorithm on Linear Processor Arrays

This paper presents the design, implementation and performance evaluation of the linear processor array accelerator for matrix multiplication. We call it matrix multiplication processor (MMP). The MMP is composed of n processing elements (PEs) connected in a chain, distributed memory, and dedicated address generator unit (AGU) to generate memory addresses. By using this approach, address generation does not increase the processing time. The AGU is one major difference in the proposed architecture compared to graphics processing units (GPUs) that use ALUs to create addresses. MMP is based on FPGA technology since this circuits possess extreme degree of parallelism and ability to customize the RAM and data path architecture to the computation. We have considered performance metrics of the proposed architecture in the sense of number of PEs, execution time, speedup, efficiency and gain factor. We have implemented AGU and PE in Xilinx Spartan 2E FPGAs using ISE 9.01 as a software tool. We compare our design with respect to the execution time, number of PEs, AT measure, speedup and efficiency with other solutions proposed in the literature.

[1]  M. P. Bekakos,et al.  Synthesis of a unidirectional systolic array for matrix-vector multiplication , 2006, Math. Comput. Model..

[2]  Sanjay J. Patel,et al.  Accelerator Architectures , 2008, IEEE Micro.

[3]  Stamatis Vassiliadis,et al.  High-Bandwidth Address Generation Unit , 2007, SAMOS.

[4]  I. V. Ramakrishnan,et al.  A Linear Array Matrix Multiplication Algorithm , 1982 .

[5]  Mile K. Stojcev,et al.  Multi-functional systolic array with reconfigurable micro-power processing elements , 2009, Microelectron. Reliab..

[6]  Hugo De Man,et al.  High-level address optimization and synthesis techniques for data-transfer-intensive applications , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[7]  Francky Catthoor,et al.  Address Generation Optimization for Embedded High-Performance Processors: A Survey , 2008, J. Signal Process. Syst..

[8]  Stamatis Vassiliadis,et al.  High-bandwidth Address Generation Unit , 2009, J. Signal Process. Syst..

[9]  Kumar N. Ganapathy,et al.  Optimal design of lower dimensional processor arrays for uniform recurrences , 1992, [1992] Proceedings of the International Conference on Application Specific Array Processors.

[10]  J.G. Nash Computationally efficient systolic architecture for computing the discrete Fourier transform , 2005, IEEE Transactions on Signal Processing.

[11]  Kumar N. Ganapathy Mapping regular recursive algorithms to fine-grained processor arrays , 1994 .

[12]  Hsuan-Shih Lee An optimal algorithm for computing the max-min transitive closure of a fuzzy similarity matrix , 2001, Fuzzy Sets Syst..

[13]  Mile K. Stojcev,et al.  Address generators for linear systolic array , 2010, Microelectron. Reliab..

[14]  Martin C. Herbordt,et al.  Computing Models for FPGA-Based Accelerators , 2008, Computing in Science & Engineering.

[15]  C. R. Wan,et al.  Massive parallel processing for matrix multiplication: a systolic approach , 2001 .

[16]  PEIZONG LEE,et al.  Synthesizing Linear Array Algorithms from Nested For Loop Algorithms , 2015, IEEE Trans. Computers.