Exploring hardware transaction processing for reliable computing in chip-multiprocessors against soft errors

With shrinking transistor feature size, lowering nodal capacitance and supply voltage at new technology generations, microprocessors are becoming more vulnerable to single-event upsets and transients, a.k.a., soft errors. While chip-multiprocessor (CMP) architecture has been employed in mainstream microprocessors and the number of on-chip processor cores keeps increasing, the system-level reliability of chip-multiprocessors is degrading reversely proportional to the core number. In this work, we propose to exploit abundant on-chip processor cores for redundant hardware transaction processing, which provides native support for error detection and recovery in transactional chip-multiprocessors (TxCMPs) against soft errors. The proposed transactional processor cores execute everything as transactions and TxCMPs execute redundant transactions on different cores. To alleviate the performance overhead due to transaction commits, we further propose two architectural optimizations, namely early partial commit packet transmission and speculative transaction execution in reliable computing mode. Our experimental evaluation confirms the effectiveness of our optimized TxCMPs in achieving low cost reliable computing against soft errors.

[1]  Jung Ho Ahn,et al.  McPAT 1 . 0 : An Integrated Power , Area , and Timing Modeling Framework for Multicore Architectures ∗ , 2010 .

[2]  Jean Arlat,et al.  Definition and analysis of hardware- and software-fault-tolerant architectures , 1990, Computer.

[3]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[4]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[5]  Yale N. Patt,et al.  Simultaneous subordinate microthreading (SSMT) , 1999, ISCA.

[6]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[7]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[8]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[9]  Yale N. Patt,et al.  Partitioned first-level cache design for clustered microarchitectures , 2003, ICS '03.

[10]  Sarita V. Adve,et al.  Shared Memory Consistency Models: A Tutorial , 1996, Computer.

[11]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  T. N. Vijaykumar,et al.  Opportunistic transient-fault detection , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[13]  M. Namjoo,et al.  WATCHDOG PROCESSORS AND CAPABILITY CHECKING , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[14]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[15]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[16]  Toshinori Sato Exploiting Instruction Redundancy for Transient Fault Tolerance , 2003 .

[17]  Miguel Castro,et al.  Practical byzantine fault tolerance and proactive recovery , 2002, TOCS.

[18]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Kanad Ghose,et al.  Trade-Offs in Transient Fault Recovery Schemes for Redundant Multithreaded Processors , 2006, HiPC.

[20]  Wei-Chung Hsu,et al.  Dynamic helper threaded prefetching on the Sun UltraSPARC/spl reg/ CMP processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[21]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[22]  André Schiper,et al.  From Causal Consistency to Sequential Consistency in Shared Memory Systems , 1995, FSTTCS.

[23]  Kewal K. Saluja,et al.  Fault tolerance through re-execution in multiscalar architecture , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[24]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[25]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[26]  Mark D. Hill,et al.  Implementing Sequential Consistency in Cache-Based Systems , 1990, ICPP.