Systolic algorithms for the CMU warp processor

CMU is building a 32-bit floating-point systolic array that can efficiently perform many essential computations in signal processing like the FFT and convolution. This is a one-dimensional systolic array that in general takes inputs from one end cell and produces outputs at the other end, with data and control all flowing in one direction. We call this particular systolic array the Warp processor, suggesting that it can perform various transformations at a very high speed We expect to have wide applications for the Warp processor, especially for the CMU prototype which has high degrees of flexibility at the expense of a relatively high chip count for each cell. The prototype has 10 cells, each of which is capable of performing 10 million floating-point operations per second (10 MFLOPS) and is build on a single board using only off-the-shelf components. This 10-ccll processor for example can process 1024-point complex FFft at a rate of one FFT every 600 /is. Under program control, the same processor can perform many other primitive computations in signal, image and vision processing, including two-dimensional convolution and complex matrix multiplication, at a rate of 100 MFLOPS. Together with another processor capable of performing divisions and square roots, the processor can also efficiently carry out a number of difficult matrix operations such as solving covariant linear systems, a crucial computation in real-time adaptive signal processing. This paper outlines the architecture of the Warp processor and describes how the signal processing tasks are implemented on the processor.

[1]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[2]  H. T. Kung,et al.  Warp: A Programmable Systolic Array Processor , 1984, Optics & Photonics.

[3]  H. T. Kung,et al.  Experience With The CMU Programmable Systolic Chip , 1984, Optics & Photonics.

[4]  H. T. Kung,et al.  Wafer-scale integration and two-level pipelined implementations of systolic arrays , 1984, J. Parallel Distributed Comput..

[5]  H. T. Kung,et al.  Synchronizing Large VLSI Processor Arrays , 1983, IEEE Transactions on Computers.

[6]  H. T. Kung,et al.  Architecture of the PSC-a programmable systolic chip , 1983, ISCA '83.

[7]  H. T. Kung,et al.  Two-level pipelined systolic array for multidimensional convolution , 1983, Image Vis. Comput..

[8]  H. T. Kung,et al.  Integrating High-Performance Special Purpose Devices Into A System , 1982, Other Conferences.

[9]  David W. L. Yen,et al.  Systolic Processing and an Implementation for Signal and Image Processing , 1982, IEEE Transactions on Computers.

[10]  R.C. Johnson,et al.  Introduction to adaptive arrays , 1982, Proceedings of the IEEE.

[11]  G. Stewart,et al.  Sparse Matrix Proceedings. , 1980 .

[12]  Journal of Parallel and Distributed Computing , 2022 .

[13]  H. T. Kung,et al.  One-Dimensional Systolic Arrays for Multidimensional Convolution and Resampling , 1984 .

[14]  Yasunori Dohi,et al.  Design of the PSC: a programmable systolic chip , 1983 .

[15]  John V. McCanny,et al.  CMOS IMPLEMENTATION OF A SYSTOLIC MULTI-BIT CONVOLVER CHIP. , 1983 .

[16]  B. A. Bowen,et al.  VLSI systems design for digital signal processing , 1982 .

[17]  C. K. Yuen,et al.  Theory and application of digital signal processing , 1975 .