论文信息 - A performance-correctness explicitly-decoupled architecture

A performance-correctness explicitly-decoupled architecture

Optimizing the common case has been an adage in decades of processor design practices. However, as the system complexity and optimization techniquespsila sophistication have increased substantially, maintaining correctness under all situations, however unlikely, is contributing to the necessity of extra conservatism in all layers of the system design. The mounting process, voltage, and temperature variation concerns further add to the conservatism in setting operating parameters. Excessive conservatism in turn hurt performance and efficiency in the common case. However, much of the systempsilas complexity comes from advanced performance features and may not compromise the whole systempsilas functionality and correctness even if some components are imperfect and introduce occasional errors. We propose to separate performance goals from the correctness goal using an explicitly-decoupled architecture. In this paper, we discuss one such incarnation where an independent core serves as an optimistic performance enhancement engine that helps accelerate the correctness-guaranteeing core by passing high-quality predictions and performing accurate prefetching. The lack of concern for correctness in the optimistic core allows us to optimize its execution in a more effective fashion than possible in optimizing a monolithic core with correctness requirements. We show that such a decoupled design allows significant optimization benefits and is much less sensitive to conservatism applied in the correctness domain.

Michael C. Huang | Alok Garg

[1] Trevor N. Mudge,et al. Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[2] Martin Burtscher,et al. On the importance of optimizing the configuration of stream prefetchers , 2005, MSP '05.

[3] Eric Rotenberg,et al. A study of slipstream processors , 2000, MICRO 33.

[4] Gurindar S. Sohi,et al. Master/slave speculative parallelization , 2002, MICRO.

[5] E SmithJames. Decoupled access/execute computer architectures , 1982 .

[6] Yale N. Patt,et al. Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[7] Milo M. K. Martin,et al. Token Coherence: decoupling performance and correctness , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[8] Yuan Chou,et al. Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9] Jose Renau,et al. CAVA: Hiding L2 Misses with Checkpoint-Assisted Value Prediction , 2004, IEEE Computer Architecture Letters.

[10] Mikko H. Lipasti,et al. Understanding scheduling replay schemes , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[11] Michael C. Huang,et al. A Performance-Correctness Explicitly-Decoupled Architecture : Technical Report , 2008 .

[12] K. Sundaramoorthy,et al. Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[13] Simha Sethumadhavan,et al. Scalable Hardware Memory Disambiguation for High-ILP Processors , 2004, IEEE Micro.

[14] C. Bazeghi,et al. /spl mu/Complexity: estimating processor design effort , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[15] Christopher Hughes,et al. Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[16] David Blaauw,et al. Making typical silicon matter with Razor , 2004, Computer.

[17] Víctor Viñals,et al. Store buffer design in first-level multibanked data caches , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[18] Olivier Temam,et al. Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[19] Jose Renau,et al. Effective Optimistic-Checker Tandem Core Design through Architectural Pruning , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[20] Gurindar S. Sohi,et al. Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[21] José F. Martínez,et al. Checkpointed early load retirement , 2005, 11th International Symposium on High-Performance Computer Architecture.

[22] Balaram Sinharoy,et al. POWER4 system microarchitecture , 2002, IBM J. Res. Dev..

[23] Robert Muth,et al. alto: a link‐time optimizer for the Compaq Alpha , 2001 .

[24] Onur Mutlu,et al. Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[25] Todd M. Austin,et al. The SimpleScalar tool set, version 2.0 , 1997, CARN.

[26] James Tschanz,et al. Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[27] C. R. Moore,et al. Scalable hardware memory disambiguation for high-ILP processors , 2004, IEEE Micro.

[28] Alpha 21264 / EV 6 Microprocessor Hardware Reference Manual , 2000 .

[29] Chi-Keung Luk,et al. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[30] Huiyang Zhou,et al. Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[31] Onur Mutlu,et al. Address-value delta (AVD) prediction: increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[32] Tong Li,et al. A large, fast instruction window for tolerating cache misses , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[33] Haitham Akkary,et al. Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[34] Jignesh M. Patel,et al. Data prefetching by dependence graph precomputation , 2001, ISCA 2001.

[35] Sanjay J. Patel,et al. Beating in-order stalls with "flea-flicker" two-pass pipelining , 2006, IEEE Transactions on Computers.

[36] Josep Torrellas,et al. Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[37] M. Dubois,et al. Assisted Execution , 1998 .

[38] Jose Renau,et al. μ Complexity : Estimating Processor Design Effort , 2005 .

[39] Craig Zilles,et al. Execution-based prediction using speculative slices , 2001, ISCA 2001.

[40] Haitham Akkary,et al. Scalable load and store processing in latency tolerant processors , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[41] Dionisios N. Pnevmatikatos,et al. Slice-processors: an implementation of operation-based prediction , 2001, ICS '01.

[42] Gurindar S. Sohi,et al. Program Demultiplexing: Data-flow based Speculative Parallelization of Methods in Sequential Programs , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[43] Todd M. Austin,et al. DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[44] Rajeev Balasubramonian,et al. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures , 2000, MICRO 33.