Soft-error mitigation by means of decoupled transactional memory threads

CMOS scaling exacerbates hardware errors making reliability a big concern for recent and future microarchitecture designs. Mechanisms to provide fault tolerance in architectures must accomplish several objectives such as low performance degradation, power consumption and area overhead. Several studies have already proposed fault tolerance for parallel codes. However, these proposals are usually implemented over non-realistic environments including the use of shared-buses among processors or modifying highly optimized hardware designs such as caches. Our attempt to face this multiple challenge is an architectural design called LBRA (Log-Based Redundant Architecture). Based on a Hardware Transactional Memory architecture, LBRA executes redundant threads which communicate through a pair-shared virtual memory log allocated in cache. Our initial version of LBRA executes these redundant threads in SMT cores. To avoid the performance penalty inherent to this architecture, we propose to decouple their execution in different cores, solving the inter-core communication by means of a log buffer empowered by a simple prefetch strategy. Simulation results using a variety of scientific and multimedia applications show that the execution time overhead of our best design is less than 7 % over a base case without fault tolerance. Additionally, we show that LBRA outperforms previous proposals that we have implemented and evaluated in the same framework.

[1]  David A. Wood,et al.  LogTM-SE: Decoupling Hardware Transactional Memory from Caches , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[2]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3]  Jim Gray,et al.  Fault Tolerance in Tandem Computer Systems , 1987 .

[4]  Yuan Taur,et al.  CMOS design near the limit of scaling , 2002 .

[5]  Norman P. Jouppi,et al.  Readings in computer architecture , 2000 .

[6]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[7]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[8]  Dean M. Tullsen,et al.  Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling , 2005, ISCA 2005.

[9]  Yen-Kuang Chen,et al.  The energy efficiency of CMP vs. SMT for multimedia workloads , 2004, ICS '04.

[10]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[11]  Keith A. Bowman,et al.  Circuit techniques for dynamic variation tolerance , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[12]  David J. Frank,et al.  Power-constrained CMOS scaling limits , 2002, IBM J. Res. Dev..

[13]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[14]  Dean M. Tullsen,et al.  Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[15]  Paul Feautrier,et al.  A New Solution to Coherence Problems in Multicache Systems , 1978, IEEE Transactions on Computers.

[16]  José M. García,et al.  A log-based redundant architecture for reliable parallel computation , 2010, 2010 International Conference on High Performance Computing.

[17]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[18]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[19]  Michael C. Huang,et al.  Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[20]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[21]  Josep Torrellas,et al.  BulkSC: bulk enforcement of sequential consistency , 2007, ISCA '07.

[22]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[23]  Yen-Kuang Chen,et al.  The ALPBench benchmark suite for complex multimedia applications , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[24]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[25]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[26]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[27]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[28]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[29]  C.H. Kim,et al.  A 0.2 V, 480 kb Subthreshold SRAM With 1 k Cells Per Bitline for Ultra-Low-Voltage Computing , 2008, IEEE Journal of Solid-State Circuits.

[30]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[31]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[32]  Richard E. Grandy,et al.  Orlando, Florida, USA , 2011 .

[33]  R. Engelbrecht,et al.  DIGEST of TECHNICAL PAPERS , 1959 .

[34]  José M. García,et al.  REPAS: Reliable Execution for Parallel ApplicationS in Tiled-CMPs , 2009, Euro-Par.

[35]  Mateo Valero Cortés,et al.  FaulTM: Fault-Tolerance Using Hardware Transactional Memory , 2010 .

[36]  Sigarch Proceedings of the 27th International Symposium on Computer Architecture, June 12-14, 2000, Vancouver, British Columbia, Canada , 2000 .

[37]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[38]  Henry Hoffmann,et al.  The Raw Microprocessor: A Computational Fabric for Software Circuits and General-Purpose Programs , 2002, IEEE Micro.

[39]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[40]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[41]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[42]  Josep Torrellas,et al.  Rebound: Scalable checkpointing for coherent shared memory , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).