Networks and algorithms for very-large-scale parallel computation

The continuing rapid progress of VLSI technology is beginning to make possible the construction of very-large-scale parallel computing assemblages. In such systems , tens or even hundreds of thousands of arithmetic devices cooperate to solve certain problems quickly. Parallel computers of this type have been contemplated for many years, 1-3 and fairly large parallel machines such as Illiac IV and ICL DAP have been made operational. Recently , however, technological progress has lent new interest to this area. It has begun to attract the attention of increasing numbers of university and industrial researchers , who have chosen various lines of attack. One significant approach pioneered by Kung focuses on the great economic and speed advantage that can be gained by designing algorithms that conform well to the restrictions imposed by VLSI technology,"7 in particular to algorithms and parallel system architectures that lay out well in two dimensions. Studies along these lines aim at the design of powerful special-purpose chips and of systems small enough to reside on a single chip. A second, more conventional approach is represented by the work reported in this article. This approach entails the use of high-performance but otherwise standard microprocessor chips tightly coupled via a suitable network. Central assumptions of this approach are that single-chip processors will be able to execute instructions at a 20-megacycle rate and that megabit memory chips will be available in quantity by the end ofthe present decade. The possibility of using modified versions of presently existing programming languages to program large parallel machines is an important feature of this work. A third line of research emphasizes architectures derived from very general abstract data flow models of parallel computation.8'9 This work has stressed the possible advantages of a purely applicative, side-effect-free programming language for the description of parallel computation. 10 These three approaches lead to machines suited for different environments. Kung's systolic arrays should be most useful for such well defined, fixed tasks as the kernel of certain signal processing applications. These arrays might be hard to adapt when the algorithms change or when many different cases must be considered. Although data flow machines have been discussed for several years, no optimal architecture has yet emerged. Later in this article, we show how a data flow language can be executed with maximum parallelism on the more conventional parallel machines described here. A crucial part of the design of any highly parallel …