Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery

Dual-core execution (DCE) is an execution paradigm proposed to utilize chip multiprocessors to improve the performance of single-threaded applications. Previous research has shown that DCE provides a complexity-effective approach to building a highly scalable instruction window and achieves significant latency-hiding capabilities. In this paper, we propose to optimize DCE for power efficiency and/or transient-fault recovery. In DCE, a program is first processed (speculatively) in the front processor and then reexecuted by the back processor. Such reexecution is the key to eliminating the centralized structures that are normally associated with very large instruction windows. In this paper, we exploit the computational redundancy in DCE to improve its reliability and its power efficiency. The main contributions include: 1) DCE-based redundancy checking for transient-fault tolerance and a complexity-effective approach to achieving full redundancy coverage and 2) novel techniques to improve the power/energy efficiency of DCE-based execution paradigms. Our experimental results demonstrate that, with the proposed simple techniques, the optimized DCE can effectively achieve transient-fault tolerance or significant performance enhancement in a power/energy-efficient way. Compared to the original DCE, the optimized DCE has similar speedups (34 percent on average) over single-core processors while reducing the energy overhead from 93 percent to 31 percent.

[1]  Josep Llosa,et al.  Out-of-order commit processors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[2]  Gurindar S. Sohi,et al.  Speculative data-driven multithreading , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[3]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[4]  Huiyang Zhou,et al.  Dual-core execution: building a highly scalable single-thread instruction window , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[5]  Haitham Akkary,et al.  Continual flow pipelines , 2004, ASPLOS XI.

[6]  Kevin Skadron,et al.  HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects , 2003 .

[7]  Onur Mutlu,et al.  Address-value delta (AVD) prediction: increasing the effectiveness of runahead execution by exploiting regular memory allocation patterns , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[8]  Gurindar S. Sohi,et al.  Multiscalar processors , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[9]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[10]  M. Horowitz,et al.  Low-power digital design , 1994, Proceedings of 1994 IEEE Symposium on Low Power Electronics.

[11]  Dean M. Tullsen,et al.  The Danger of Interval-Based Power Efficiency Metrics: When Worst Is Best , 2005, IEEE Computer Architecture Letters.

[12]  Eric Rotenberg,et al.  A large, fast instruction window for tolerating cache misses , 2002, ISCA.

[13]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[14]  John Paul Shen,et al.  Memory latency-tolerance approaches for Itanium processors: out-of-order execution vs. speculative precomputation , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[15]  Michael C. Huang,et al.  Exploiting coarse-grain verification parallelism for power-efficient fault tolerance , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[16]  Onur Mutlu,et al.  Runahead execution: an alternative to very large instruction windows for out-of-order processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[17]  Christopher Hughes,et al.  Speculative precomputation: long-range prefetching of delinquent loads , 2001, ISCA 2001.

[18]  Mateo Valero,et al.  A decoupled KILO-instruction processor , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[19]  Manish Gupta,et al.  Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors , 2000, IEEE Micro.

[20]  Haitham Akkary,et al.  Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors , 2003, MICRO.

[21]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[22]  Sanjay J. Patel,et al.  Beating in-order stalls with "flea-flicker" two-pass pipelining , 2006, IEEE Transactions on Computers.

[23]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[24]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[25]  Onur Mutlu,et al.  Techniques for efficient processing in runahead execution engines , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[26]  Babak Falsafi,et al.  Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[27]  Rajeev Balasubramonian,et al.  Dynamically allocating processor resources between nearby and distant ILP , 2001, ISCA 2001.

[28]  K. Sundaramoorthy,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[29]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[30]  Huiyang Zhou,et al.  A case for fault tolerance and performance enhancement using chip multi-processors , 2006, IEEE Computer Architecture Letters.

[31]  Mikko H. Lipasti,et al.  Memory ordering: a value-based approach , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[32]  Wen-mei W. Hwu,et al.  "Flea-flicker" multipass pipelining: an alternative to the high-power out-of-order offense , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[33]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[34]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[35]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[36]  Craig Zilles,et al.  Execution-based prediction using speculative slices , 2001, ISCA 2001.

[37]  Martin Burtscher,et al.  Future execution: a hardware prefetching technique for chip multiprocessors , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[38]  Trevor N. Mudge,et al.  Author retrospective improving data cache performance by pre-executing instructions under a cache miss , 1997, International Conference on Supercomputing.

[39]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[40]  Gurindar S. Sohi,et al.  Master/slave speculative parallelization , 2002, MICRO.

[41]  William J. Dally,et al.  Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[42]  Haitham Akkary,et al.  A minimal dual-core speculative multi-threading architecture , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..