High-performance throughput computing

CMT processors offer a way to significantly improve the performance of computer systems. The return on investment for multithreading is among the highest in computer microarchitectural techniques. If you design a core from scratch to support multithreading, gains as high as 3/spl times/ are possible for just a 20 percent increase in area. Even with throughput performance as the main target, we have shown that the microarchitecture necessary to support threads on a CMT can also achieve high single-thread performance. Hardware scouting, which Sun is implementing on the Rock microprocessor, can increase the single-thread performance of applications by up to 40 percent. Alternatively, scouting is a technique that makes the on-chip caches appear much larger, performance robustness technique, making up for code tailored for different on-chip cache sizes or even a different number and levels of caches.

[1]  Brian Fahs,et al.  Microarchitecture optimizations for exploiting memory-level parallelism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[2]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[3]  Rajeev Balasubramonian,et al.  Dynamically allocating processor resources between nearby and distant ILP , 2001, ISCA 2001.

[4]  Trevor N. Mudge,et al.  Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[5]  Marc Tremblay,et al.  The MAJC Architecture: A Synthesis of Parallelism and Scalability , 2000, IEEE Micro.

[6]  John Paul Shen,et al.  Dynamic speculative precomputation , 2001, MICRO.

[7]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[8]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[9]  Mikko H. Lipasti,et al.  A performance methodology for commercial servers , 2000, IBM J. Res. Dev..

[10]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[11]  Trevor Mudge,et al.  Thread-level parallelism and interactive performance of desktop applications , 2000, SIGP.

[12]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[13]  Balaram Sinharoy,et al.  POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[14]  Balaram Sinharoy,et al.  Design and implementation of the POWER5 microprocessor , 2004, Proceedings. 41st Design Automation Conference, 2004..

[15]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..