Memory requirements for balanced computer architectures

Abstract In this paper, a processing element (PE) is characterized by its computation bandwidth, I/O bandwidth, and the size of its local memory. In carrying out a computation, a PE is said to be balanced if the computing time equals the I/O time. Consider a balanced PE for some computation. Suppose that the computation band-width of the PE is increased by a factor of α relative to its I/O bandwidth. Then when carrying out the same computation the PE will be imbalanced; i.e., it will have to wait for I/O. A standard method of avoiding this I/O bottleneck is to reduce the overall I/O requirement of the PE by increasing the size of its local memory. This paper addresses the question of by how much the PE's local memory must be enlarged in order to restore balance. The following results are shown: For matrix computations such as matrix multiplication and Gaussian elimination, the size of the local memory must be increased by a factor of α 2 . For computations such as relaxation on a k -dimensional grid, the local memory must be enlarged by a factor of α k . For some other computations such as the FFT and sorting, the increase is exponential; i.e., the size of the new memory must be the size of the original memory to the αth power. All these results indicate that to design a balanced PE, the size of its local memory must be increased much more rapidly than its computation bandwidth. This phenomenon seems to be common for many computations where an output may depend on a large subset of the inputs. Implications of these results for some parallel computer architectures are also discussed. One particular result is that to balance an array of p linearly connected PEs for performing matrix computations such as matrix multiplication and matrix triangularization, the size of each PE's local memory must grow linearly with p . Thus, the larger the array is, the larger each PE's local memory must be.

[1]  H. T. Kung,et al.  Warp as a machine for low-level vision , 1985, Proceedings. 1985 IEEE International Conference on Robotics and Automation.

[2]  H. T. Kung,et al.  A systolic array computer , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Amar Gupta,et al.  An Architectural Comparison of 32-Bit Microprocessors , 1983, IEEE Micro.

[4]  Allen Newell,et al.  Computer Structures: Principles and Examples , 1983 .

[5]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[6]  Jan Fandrianto,et al.  VLSI floating-point processors , 1985, 1985 IEEE 7th Symposium on Computer Arithmetic (ARITH).

[7]  H. T. Kung,et al.  Matrix Triangularization By Systolic Arrays , 1982, Optics & Photonics.

[8]  H. T. Kung,et al.  Systolic Arrays for (VLSI). , 1978 .

[9]  Siang Wun Song,et al.  On a high-performance vlsi solution to database problems , 1981 .