Address generators for linear systolic array

Systolic arrays (SAs) are very efficient architectures for multimedia processing, database management, and scientific computing applications that are characterized by a high number of data access. However, in these data transfer and storage intensive applications, memory access is often the limiting factor to the computation speed. Since the memory subsystem dominates the cost (area), performance and power consumption of the SA, we have to pay a special attention to how memory subsystem can benefit from customization. In this paper we consider memory organization of linear systolic array with bi-directional links (called BLSA) suitable for implementation of broad class of algorithms. We assume that memory is organized into distributed smaller physical memory modules. In order to provide high bandwidth in data access we have designed special hardware, called address generator unit (AGU). The function of AGU is threefold. First, during the initialization, it transforms host address space into BLSA address space. Second, provides efficient memory data access during BLSA operation. Third, performs fast data transfer between BLSA and host at the end of the computation. In this article, we examine the impact on area and performance of memory access related circuity in eliminating computational intensive offset address calculations performed in software by implementing the needed address transformations with the AGUs. By involving hardware AGUs we achieved a speedup of approximately two, compared to the software implementation of address calculation, with a hardware overhead of only 7.6% in the worst case.

[1]  Bruce M. Maggs,et al.  Minimum-Cost Spanning Tree as a Path-Finding Problem , 1988, Inf. Process. Lett..

[2]  Tadao Takaoka,et al.  Subcubic Cost Algorithms for the All Pairs Shortest Path Problem , 1998, Algorithmica.

[3]  Francky Catthoor,et al.  Address Generation Optimization for Embedded High-Performance Processors: A Survey , 2008, J. Signal Process. Syst..

[4]  Stamatis Vassiliadis,et al.  High-bandwidth Address Generation Unit , 2009, J. Signal Process. Syst..

[5]  C. R. Wan,et al.  Massive parallel processing for matrix multiplication: a systolic approach , 2001 .

[6]  Sun-Yuan Kung,et al.  Optimal Systolic Design for the Transitive Closure and the Shortest Path Problems , 1987, IEEE Transactions on Computers.

[7]  Jong-Chuang Tsay,et al.  A Family of Efficient Regular Arrays for Algebraic Path Problem , 1994, IEEE Trans. Computers.

[8]  Emina I. Milovanovic,et al.  Computing Transitive Closure Problem on Linear Systolic Array , 2004, NAA.

[9]  Tadao Takaoka Sub-Cubic Cost Algorithms for the All Pairs Shortest Path Problem , 1995, WG.

[10]  Mile K. Stojcev,et al.  Hexagonal systolic arrays for matrix multiplication , 2001 .

[11]  Yi Pan,et al.  Solving graph theory problems using reconfigurable pipelined optical buses , 2000, Parallel Comput..

[12]  Andrew R. Pleszkun,et al.  Implementation of the PIPE processor , 1991, Computer.

[13]  Wojciech Rytter,et al.  Coarse-Grained Parallel Transitive Closure Algorithm: Path Decomposition Technique , 2003, Comput. J..

[14]  D.I. Moldovan,et al.  On the design of algorithms for VLSI systolic arrays , 1983, Proceedings of the IEEE.

[15]  Mile K. Stojcev,et al.  Matrix-vector multiplication on a fixed-size linear systolic array , 2000 .

[16]  M. P. Bekakos,et al.  Computing all-pairs shortest paths on a linear systolic array and hardware realization on a reprogrammable FPGA platform , 2006, The Journal of Supercomputing.

[17]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[18]  Hsuan-Shih Lee An optimal algorithm for computing the max-min transitive closure of a fuzzy similarity matrix , 2001, Fuzzy Sets Syst..

[19]  Z Igor Milovanovic,et al.  MATRIX MULTIPLICATION ON BIDIRECTIONAL LINEAR SYSTOLIC ARRAYS , 2003 .

[20]  M. P. Bekakos Highly parallel computations : algorithms and applications , 2001 .

[21]  Yongge Huang Supercomputing Research Advances , 2008 .

[22]  Benjamin W. Wah,et al.  The Design of Optimal Systolic Arrays , 1985, IEEE Transactions on Computers.

[23]  Hugo De Man,et al.  High-level address optimization and synthesis techniques for data-transfer-intensive applications , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[24]  Mile K. Stojcev,et al.  Data reordering converter: an interface block in a linear chain of processing arrays , 2000 .

[25]  E. O. Nwachukwu Address Generation in an Array Processor , 1985, IEEE Transactions on Computers.

[26]  Lizy Kurian John,et al.  Memory Latency Effects in Decoupled Architectures , 1994, IEEE Trans. Computers.

[27]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[28]  Mile K. Stojcev,et al.  Multi-functional systolic array with reconfigurable micro-power processing elements , 2009, Microelectron. Reliab..

[29]  Ie-Bin Lian Reconstruction of additive phylogenetic tree , 2001, Fuzzy Sets Syst..