Simultaneous multithreaded vector architecture: merging ILP and DLP for high performance

Shows that instruction-level parallelism (ILP) and data-level parallelism (DLP) can be merged in a single simultaneous vector multithreaded architecture to execute regular vectorizable code at a performance level that cannot be achieved using either paradigm on its own. We show that the combination of the two techniques yields very high performance at a low cost and a low complexity. We show that this architecture achieves a sustained performance on numerical regular codes that is 20 times the performance that can be achieved with today's superscalar microprocessors. Moreover, we show that the architecture can tolerate very large memory latencies, of up to a 100 cycles, with a relatively small performance degradation. This high performance is independent of working set size or of locality considerations, since the DLP paradigm allows very efficient exploitation of a high-performance flat memory bandwidth.

[1]  R. P. Colwell,et al.  A 0.6 /spl mu/m BiCMOS processor with dynamic execution , 1995, Proceedings ISSCC '95 - International Solid-State Circuits Conference.

[2]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[3]  Tadashi Watanabe,et al.  The Parallel Processing Feature of the NEC SX-3 Supercomputer System , 1991, Int. J. High Speed Comput..

[4]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[5]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[6]  David W. Anderson,et al.  The IBM System/360 model 91: machine philosophy and instruction-handling , 1967 .

[7]  Mateo Valero,et al.  Decoupled vector architectures , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[8]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, ISCA.

[9]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[10]  Dean M. Tullsen,et al.  Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[11]  James E. Smith,et al.  A Simulation Study of Decoupled Architecture Computers , 1986, IEEE Transactions on Computers.

[12]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[13]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[14]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[15]  Wilfried Oed Cray Y-MP C90: System features and early benchmark results (Short communication) , 1992, Parallel Comput..

[16]  Mateo Valero,et al.  Out-of-order vector architectures , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[17]  Mateo Valero,et al.  Multithreaded vector architectures , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[18]  Anant Agarwal,et al.  Performance Tradeoffs in Multithreaded Processors , 1992, IEEE Trans. Parallel Distributed Syst..

[19]  Kozo Kimura,et al.  An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.