Linear Array For Efficient Execution Of Partitioned Matrix Algorithms

We propose a class-specific linear array suitable for partitioned execution of matrix algorithms, which achieves high efficiency, exploits pipelining within cells in a simple manner, has off cells communication rate lower than computation rate, and has a small storage per cell (whose size is independent of the size of problems). This array is well suited to use the MMG method, a data-dependency graph-based mapping technique. The MMG method has capabilities to realize fixed-size data and partitioned problems as algorithm-specific arrays, and to map algorithms onto class-specific arrays. The array proposed here uses the mapping capabilities of the method, which combine coalescing and cut-and-pile as partition strategies. Mapping is illustrated using the LU-decomposition algorithm; results obtained from mapping other algorithms are also indicated. Performance estimates of the mappings show that, for example, LU-decomposition of a 2000 by 2000 matrix computed in a linear array with 100-cells, two operation units per cell in a 4-stage pipeline, and 50 [nsec] clock period (i.e., 4000 [Mflops]), achieves 87% efficiency (3480 [Mflops]). This performance is obtained while requiring communication among cells of only 5 [Mwords/sec] and peak external I/O bandwidth for the entire array also of 5 [Mwords/sec]. Moreover, for a problem of this size, the use of cut-and-pile leads to storage requirements of only 8000 words per memory module.

[1]  S. Kung,et al.  VLSI Array processors , 1985, IEEE ASSP Magazine.

[2]  W. E. Gentleman Least Squares Computations by Givens Transformations Without Square Roots , 1973 .

[3]  David E. Foulser,et al.  The Saxpy Matrix-1: A General-Purpose Systolic Computer , 1987, Computer.

[4]  K. Wojtek Przytula,et al.  The Systolic/Cellular System for Signal Processing , 1987, Computer.

[5]  H. T. Kung,et al.  The Warp Computer: Architecture, Implementation, and Performance , 1987, IEEE Transactions on Computers.

[6]  Thomas C. Henderson,et al.  Video analysis transputer array , 1988 .

[7]  Tomás Lang,et al.  Arrays For Partitioned Matrix Algorithms: Tradeoffs Between Cell Storage And Cell Bandwidth , 1989, Optics & Photonics.

[8]  H. T. Kung Why systolic architectures? , 1982, Computer.

[9]  J. H. Moreno Comparing design methods based on index-dependencies and on data-dependencies , 1990 .

[10]  H. T. Kung,et al.  The Domain Parallel Computation Model On Warp , 1989, Optics & Photonics.

[11]  Benjamin W. Wah,et al.  Systematic approaches to the design of algorithmically specified systolic arrays , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  J. H. Moreno Matrix computations on mesh arrays , 1990 .

[13]  Franklin T. Luk,et al.  SLAPP: A Systolic Linear Algebra Parallel Processor , 1987, Computer.

[14]  Mateo Valero,et al.  Partitioning: An Essential Step in Mapping Algorithms Into Systolic Array Processors , 1987, Computer.