Highly concurrent scalar processing

High speed scalar processing is an essential characteristic of high performance general purpose computer systems. Highly concurrent execution of scalar code is difficult due to data dependencies and conditional branches. This paper proposes an architectural concept called guarded instructions to reduce the penalty of conditional branches in deeply pipelined processors. A code generation heuristic, the decision tree scheduling technique, reorders instructions in a complex of basic blocks so as to make efficient use of guarded instructions. Performance evaluation of several benchmarks are presented, including a module from the UNIX kernel. Even with these difficult scalar code examples, a speedup of two is achievable by using conventional pipelined uniprocessors augmented by guard instructions, and a speedup of three or more can be achieved using processors with parallel instruction pipelines.

[1]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[2]  Jr. Edward Willmore Davis,et al.  A multiprocessor for simulation applications. , 1972 .

[3]  J. F. Thorlin Code generation for PIE (Parallel Instruction Execution) computers , 1967, AFIPS '67 (Spring).

[4]  Hironori Kasahara,et al.  Practical Multiprocessor Scheduling Algorithms for Efficient Parallel Processing , 1984, IEEE Transactions on Computers.

[5]  Edward S. Davidson,et al.  Highly concurrent scalar processing , 1986, ISCA 1986.

[6]  Edward M. Riseman,et al.  The Inhibition of Potential Parallelism by Conditional Jumps , 1972, IEEE Transactions on Computers.

[7]  JOHN L. HENNESSY,et al.  VLSI Processor Architecture , 1984, IEEE Transactions on Computers.

[8]  James E. Smith,et al.  Decoupled access/execute computer architectures , 1984, TOCS.

[9]  James E. Thomton,et al.  Parallel Operation in the Control Data 6600 , 1899 .

[10]  Edsger W. Dijkstra,et al.  Guarded commands, nondeterminacy and formal derivation of programs , 1975, Commun. ACM.

[11]  Norman P. Jouppi,et al.  Hardware/software tradeoffs for increased performance , 1982, ASPLOS I.

[12]  Edward G. Coffman,et al.  Computer and job-shop scheduling theory , 1976 .

[13]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[14]  Peter M. Kogge,et al.  The Architecture of Pipelined Computers , 1981 .

[15]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[16]  George Radin,et al.  The 801 minicomputer , 1982, ASPLOS I.

[17]  Thomas R. Gross,et al.  Optimizing delayed branches , 1982, MICRO 15.

[18]  James E. Smith,et al.  Instruction Issue Logic in Pipelined Supercomputers , 1984, IEEE Transactions on Computers.

[19]  David W. Anderson,et al.  The IBM System/360 model 91: machine philosophy and instruction-handling , 1967 .

[20]  Henry M. Levy,et al.  Measurement and analysis of instruction use in the VAX-11/780 , 1982, ISCA.

[21]  Norman P. Jouppi,et al.  MIPS: A microprocessor architecture , 1982, MICRO 15.