The VLIW Machine: A Multiprocessor for Compiling Scientific Code

years of sequential programming, can we realistically expect to gain speedup with parallel architectures by developing parallel code? With very long instruction word machines, we may not have to. Most of us agree that highly parallel hardware is the natural solution for speeding up computation. Unfortunately , while we are confident about building large parallel machines, we must question our ability to produce parallel, general-purpose scientific code. Many feel that this inability stems from 30 years of programming sequential architectures, calling it the "von Neunmann mindset." The feeling is that once there are enough parallel architectures around and enough bright programmers to use them, a parallel programming methodology will develop naturally. Many parallel architectures acre being developed today, and this hypothesis wsill at least get a meaningful test. lf it is proved true, so much the better. It may also be proved false, since it may not be humanly possible to ssrite large, highly parallel, general-purpose scientific code. The arguments for this supposition go something like this: (1) So f'ar, we've been largely unsuccessful in asking programmers to express entire programs in parallel. (2) The architecture cannot require a close match between the algorithms and the hardware. When such a match occurs, it can certainly yield a great speed-up, but matches and the great speedups-are bound to be infrequent. Given a fixed structure in the hardware, it is unlikely to match the fixed structure of many algorithms. Some problems, such as many found in signal processing, do match vector machines well, for example. General-purpose code, however, doesn't often work. (3) The whole problem must run in parallel. Most scientific programs are built of simple inner loops and the highly irregular code that does the setting up, postprocessing, and special cases. It is unusual w hen more than 80 percent of the running time is in inner loops. Even if parallelism reduces the running time of the inner loops to zero, you can speed up the code only by a factor of five unless you find parallelism throughout the rest of the problem. These objections apply to both multiprocessors and vector machines. With multiprocessors, a programmer must identify large sections of code that are relatively data independent. With vector machines, a programmer must identif'y large operations that work on vectors as an aggregate. Even if a compiler can sometimes produce vector-ized code as some claim, I vector machines speed up only …