Software pipelining: an effective scheduling technique for VLIW machines

This paper shows that software pipelining is an effective and viable scheduling technique for VLIW processors. In software pipelining, iterations of a loop in the source program are continuously initiated at constant intervals, before the preceding iterations complete. The advantage of software pipelining is that optimal performance can be achieved with compact object code. This paper extends previous results of software pipelining in two ways: First, this paper shows that by using an improved algorithm, near-optimal performance can be obtained without specialized hardware. Second, we propose a hierarchical reduction scheme whereby entire control constructs are reduced to an object similar to an operation in a basic block. With this scheme, all innermost loops, including those containing conditional statements, can be software pipelined. It also diminishes the start-up cost of loops with small number of iterations. Hierarchical reduction complements the software pipelining technique, permitting a consistent performance improvement be obtained. The techniques proposed have been validated by an implementation of a compiler for Warp, a systolic array consisting of 10 VLIW processors. This compiler has been used for developing a large number of applications in the areas of image, signal and scientific processing.

[1]  Joseph Allen Fisher,et al.  The Optimization of Horizontal Microcode within and Beyond Basic Blocks: an Application of Processor Scheduling with Resources , 2018 .

[2]  Monica Sin-Ling Lam,et al.  A Systolic Array Optimizing Compiler , 1989 .

[3]  Alexander Aiken,et al.  Perfect Pipelining: A New Loop Parallelization Technique , 1988, ESOP.

[4]  H. T. Kung,et al.  The Warp Computer: Architecture, Implementation, and Performance , 1987, IEEE Transactions on Computers.

[5]  Jian Wang,et al.  GURPR—a method for global software pipelining , 1987, MICRO 20.

[6]  Kemal Ebcioglu,et al.  A compilation technique for software pipelining of loops with conditional jumps , 1987, MICRO 20.

[7]  James E. Smith,et al.  A study of scalar compilation techniques for pipelined supercomputers , 1987, ASPLOS.

[8]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[9]  Bogong Su,et al.  URPR—An extension of URCR for software pipelining , 1986, MICRO 19.

[10]  Thomas R. Gross,et al.  Compilation for a high-performance systolic array , 1986, SIGPLAN '86.

[11]  Peter Y.-T. Hsu,et al.  Highly concurrent scalar processing , 1986, ISCA '86.

[12]  Bogong Su,et al.  An improvement of trace scheduling for global microcode compaction , 1984, MICRO 17.

[13]  Roy F. Touzeau A Fortran compiler for the FPS-164 scientific computer , 1984, SIGPLAN '84.

[14]  Daniel E. Atkins,et al.  Tree compaction of microprograms , 1983, SIGM.

[15]  Joseph L. Linn,et al.  SRDAG compaction: a generalization of trace scheduling to increase the use of global context information , 1983, SIGM.

[16]  Toru Ishida,et al.  Global Compaction of Horizontal Microprograms Based on the Generalized Data Dependency Graph , 1983, IEEE Transactions on Computers.

[17]  B. Ramakrishna Rau,et al.  Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing , 1981, MICRO 14.

[18]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[19]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[20]  Graham Wood,et al.  Global optimization of microprograms through modular control constructs , 1979, MICRO 12.

[21]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[22]  G. Dantzig,et al.  ALL SHORTEST ROUTES FROM A FIXED ORIGIN IN A GRAPH , 1966 .

[23]  Stephen J. Garland,et al.  Algorithm 97: Shortest path , 1962, Commun. ACM.