Transactional Memory for Reliability

It is foreseen that technology trends will increase the transient and permanent fault rates in future processors. Thus providing reliability for both the applications running on personal computers and running on mission-critical systems is becoming an absolute necessity. A reliable system requires the inclusion of two key capabilities: 1) error detection and 2) error recovery mechanisms. Transactional Memory (TM) provides an ideal base for both error detection and error recovery. First, TM provides mechanisms to abort transactions in case of a conflict, thus they discard or undo all the tentative memory updates and restart the execution from the beginning of the transaction. Thus, a transaction’s start can be viewed as a locally checkpointed stable state which can be used for error recovery. Second, transactional semantics allows the error detection to be deferred until a transaction commits (or the value becomes externally visible), so that the cost of error detection can be reduced compared to traditional error detection schemes (in which error detection is conducted et every instruction [26]) while its efficiency can be increased.

[1]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[2]  Annette Bieniusa,et al.  Consistency in hindsight: A fully decentralized STM algorithm , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[3]  Torvald Riegel,et al.  Composable Error Recovery With Transactional Memory , 2009, Bull. EATCS.

[4]  José M. García,et al.  Soft-error mitigation by means of decoupled transactional memory threads , 2015, Distributed Computing.

[5]  Mateo Valero,et al.  EazyHTM: EAger-LaZY hardware Transactional Memory , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Luís E. T. Rodrigues,et al.  D2STM: Dependable Distributed Software Transactional Memory , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[7]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[8]  Karthik Pattabiraman,et al.  Towards understanding the effects of intermittent hardware faults on programs , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[9]  Luís E. T. Rodrigues,et al.  Cloud-TM: harnessing the cloud with distributed transactional memories , 2010, OPSR.

[10]  Osman S. Unsal,et al.  Fault tolerance for multi-threaded applications by leveraging hardware transactional memory , 2013, CF '13.

[11]  Luís E. T. Rodrigues,et al.  A Generic Framework for Replicated Software Transactional Memories , 2011, 2011 IEEE 10th International Symposium on Network Computing and Applications.

[12]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[13]  Josep Torrellas,et al.  Rebound: Scalable checkpointing for coherent shared memory , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[14]  Koushik Chakraborty,et al.  Adapting to intermittent faults in multicore systems , 2008, ASPLOS.

[15]  Christian Scheideler,et al.  Stabilization, Safety, and Security of Distributed Systems , 2012, Lecture Notes in Computer Science.

[16]  Mikel Luján,et al.  DiSTM: A Software Transactional Memory Framework for Clusters , 2008, 2008 37th International Conference on Parallel Processing.

[17]  Mateo Valero,et al.  SymptomTM: Symptom-Based Error Detection and Recovery Using Hardware Transactional Memory , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[18]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  Nematollah Bidokhti SEU concept to reality (allocation, prediction, mitigation) , 2010, 2010 Proceedings - Annual Reliability and Maintainability Symposium (RAMS).

[20]  N. Hengartner,et al.  Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer , 2005, IEEE Transactions on Device and Materials Reliability.

[21]  Christof Fetzer,et al.  Transactional memory for dependable embedded systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[22]  David A. Wood,et al.  LogTM: log-based transactional memory , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[23]  Bradford L. Chamberlain,et al.  Software transactional memory for large scale clusters , 2008, PPoPP.

[24]  Osman S. Unsal,et al.  FaulTM: Error detection and recovery using Hardware Transactional Memory , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[25]  Christof Fetzer,et al.  Transactional Encoding for Tolerating Transient Hardware Errors , 2013, SSS.

[26]  Mateo Valero Cortés,et al.  FaulTM: Fault-Tolerance Using Hardware Transactional Memory , 2010 .

[27]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[28]  Zhiying Wang,et al.  Transient Fault Recovery on Chip Multiprocessor based on Dual Core Redundancy and Context Saving , 2008, 2008 The 9th International Conference for Young Computer Scientists.

[29]  Elena Tsanko,et al.  Verification of transactional memory in POWER8 , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[30]  Kewal K. Saluja,et al.  Built-in self-testing of random-access memories , 1990, Computer.

[31]  Christof Fetzer,et al.  Transactional Memory for Dependable Embedded Systems (Poster) , 2011, Hot Topics in System Dependability.

[32]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[33]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[34]  Binoy Ravindran,et al.  On Closed Nesting and Checkpointing in Fault-Tolerant Distributed Transactional Memory , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[35]  Kunle Olukotun,et al.  Transactional memory coherence and consistency , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..