Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margins

The power envelope has become a major issue for the design of computer systems. One way of reducing energy consumption is to downscale the voltage of microprocessors. However, this does not come without costs. By decreasing the voltage, the likelihood of failures increases drastically and without mechanisms for reliability, the systems would not operate any more. For reliability we need (1) error detection and (2) error recovery mechanisms. We provide in this paper a first study investigating the combination of different error detection mechanisms with transactional memory, with the objective to improve energy efficiency. According to our evaluation, using reliability schemes combined with transactional memory for error recovery reduces energy by 54% while providing a reliability level of 100%.

[1]  Josep Torrellas,et al.  Rebound: Scalable checkpointing for coherent shared memory , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[2]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[3]  Dan Grossman,et al.  EnerJ: approximate data types for safe and general low-power computation , 2011, PLDI '11.

[4]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[5]  Franck Petit,et al.  Stabilization, Safety, and Security of Distributed Systems , 2016, Lecture Notes in Computer Science.

[6]  Martin Müller,et al.  Software protection mechanisms for dependable systems , 2008 .

[7]  Christof Fetzer,et al.  Software-Implemented Hardware Error Detection: Costs and Gains , 2010, 2010 Third International Conference on Dependability.

[8]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[9]  James R. Larus,et al.  Transactional Memory , 2006, Transactional Memory.

[10]  Sarita V. Adve,et al.  Using likely program invariants to detect hardware errors , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[11]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[12]  Ying Zhang,et al.  Fault recovery based on checkpointing for hard real-time embedded systems , 2003, Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems.

[13]  Nancy G. Leveson,et al.  The Use of Self Checks and Voting in Software Error Detection: An Empirical Study , 1990, IEEE Trans. Software Eng..

[14]  Mateo Valero,et al.  SymptomTM: Symptom-Based Error Detection and Recovery Using Hardware Transactional Memory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[15]  Christof Fetzer,et al.  Transactional Encoding for Tolerating Transient Hardware Errors , 2013, SSS.

[16]  David Blaauw,et al.  Near-Threshold Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits , 2010, Proceedings of the IEEE.

[17]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[18]  James R. Larus,et al.  Transactional Memory, 2nd edition , 2010, Transactional Memory.

[19]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[20]  Shankar Balachandran,et al.  The Implications of Shared Data Synchronization Techniques on Multi-Core Energy Efficiency , 2012, HotPower.

[21]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[22]  Doe Hyun Yoon,et al.  Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.

[23]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[24]  Christof Fetzer,et al.  Transactional memory for dependable embedded systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[25]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[26]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[27]  Mateo Valero Cortés,et al.  FaulTM: Fault-Tolerance Using Hardware Transactional Memory , 2010 .

[28]  Maurice Herlihy,et al.  Energy reduction in multiprocessor systems using transactional memory , 2005, ISLPED '05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005..

[29]  Anne-Marie Kermarrec,et al.  A recoverable distributed shared memory integrating coherence and recoverability , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[30]  Radu Teodorescu,et al.  Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors , 2013, ISCA.

[31]  K. Roy,et al.  A 160 mV Robust Schmitt Trigger Based Subthreshold SRAM , 2007, IEEE Journal of Solid-State Circuits.

[32]  Christof Fetzer,et al.  ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software , 2010, SAFECOMP.

[33]  Osman S. Unsal,et al.  Fault tolerance for multi-threaded applications by leveraging hardware transactional memory , 2013, CF '13.

[34]  Laura L. Pullum,et al.  Software Fault Tolerance Techniques and Implementation , 2001 .

[35]  Rana Ejaz Ahmed,et al.  Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[36]  William G. Griswold,et al.  Dynamically discovering likely program invariants to support program evolution , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[37]  P. Forin,et al.  VITAL CODED MICROPROCESSOR PRINCIPLES AND APPLICATION FOR VARIOUS TRANSIT SYSTEMS , 1990 .