TTAs: Missing the ILP complexity wall

A common approach to enhance the performance of processors is to increase the number of function units which operate concurrently. We observe this development in all recent general purpose superscalar processors, and in VLIW (very long instruction word) processors used for more dedicated application domains, like the multi-media domain. This paper analyzes the data path complexity of ILP processors (in particular VLIWs), and shows that they soon may hit the complexity wall; their complexity gets out of control when scaling to very high performance. Several methods are investigated for reducing this complexity. Essentially these methods trade hardware for software complexity, i.e., performing as much as possible at compile time. Combining these methods results in a new architecture, called transport triggered architecture or TTA. The concept of transport triggering is outlined together with its characteristics. It will be shown that the application of this concept results in a number of hardware advantages, and introduces a number of new scheduling optimizations. Together they substantially reduce the ILP complexity bottleneck, which will be demonstrated by a number of experiments.

[1]  David E. Culler,et al.  Dataflow architectures , 1986 .

[2]  Xin Wang,et al.  Compiler Techniques for Concurrent Multithreading with Hardware Speculation Support , 1996, LCPC.

[3]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach, 5th Edition , 1996 .

[4]  H. B. Bakoglu,et al.  Circuits, interconnections, and packaging for VLSI , 1990 .

[5]  Henk Corporaal Transport Triggered Architectures : Design and Evaluation , 1995 .

[6]  Henk Corporaal,et al.  Registers On Demand: Integrated register allocation and instruction scheduling , 1997 .

[7]  Henk Corporaal,et al.  Partitioned register file for TTAs , 1995, MICRO 1995.

[8]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[9]  Henk Corporaal,et al.  High Performance Image Processing using TTAs , 1996 .

[10]  Henk Corporaal,et al.  Cosynthesis with the MOVE framework , 1996 .

[11]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[12]  Burton J. Smith,et al.  A processor architecture for Horizon , 1988, Proceedings. SUPERCOMPUTING '88.

[13]  Jenn-Yuan Tsai,et al.  The superthreaded architecture: thread pipelining with run-time data dependence checking and control speculation , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[14]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[15]  Henk Corporaal,et al.  The Impact of Data Communication and Control Synchronization on Coarse-Grain Task Parallelism , 1996 .

[16]  Mike Johnson,et al.  Superscalar microprocessor design , 1991, Prentice Hall series in innovative technology.

[17]  Henk Corporaal,et al.  Code generation for transport triggered architectures , 1994, Code Generation for Embedded Processors.

[18]  Donald Yeung,et al.  Sparcle: an evolutionary processor design for large-scale multiprocessors , 1993, IEEE Micro.

[19]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[20]  Henk Corporaal Microprocessor architectures - from VLIW to TTA , 1997 .

[21]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[22]  Kevin O'Brien,et al.  Single-program speculative multithreading (SPSM) architecture: compiler-assisted fine-grained multithreading , 1995, PACT.

[23]  Michael D. Noakes,et al.  The J-machine multicomputer: an architectural evaluation , 1993, ISCA '93.

[24]  Henk Corporaal,et al.  The Utilization of a Fully Conngurable Microprocessor Development Environment for Rapid Vhdl Prototyping and Implementation of 'c'-based Algorithms , 1996 .