Fault Recovery Based on Parallel Recomputing in Transactional Memory System

This paper addresses the issue of fault recovery in transactional memory, and proposes a method of fault recovery based on parallel recomputing in transactional memory system. This method utilizes the data-versioning mechanism of transactional memory system to avoid the extra cost of state saving, rolls back a single transaction to avoid wasting the computing time of the fault-free transactions, and adopts the parallel recomputing method to reduce the cost of fault recovery. This paper applies this method to OpenTM programs, and proposes the implementation method of parallel recomputing in OpenTM. At last, this paper tests the performance of this method through a test program. The experimental results show that, compared with the fault recovery method of rolling back a single transaction, the parallel recomputing method in transactional memory system can execute the fault recovery quickly and accurately and the method has a well scalability.

[1]  Xuejun Yang,et al.  The Fault Tolerant Parallel Algorithm: the Parallel Recomputing Based Failure Recovery , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[2]  Erik Seligman,et al.  Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[3]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[4]  Eduard Ayguadé,et al.  Transactional Memory: An Overview , 2007, IEEE Micro.

[5]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[6]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[7]  Kunle Olukotun,et al.  The OpenTM Transactional Application Programming Interface , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[8]  James R. Larus,et al.  Transactional memory , 2008, CACM.