Data storage optimization of application-level checkpointing on heterogeneous systems

General purpose GPU's (GPGPU) appearance made it possible that heterogeneous computing can be used by human beings. And it's also produce a reform for GPU's general purpose computing and parallel computing. Heterogeneous Systems has been adopted by large-scale of high-performance computers. Nowadays, fault tolerance technique is necessary among these large-scale kinds of scientific computing, but in a few years of GPGPU and heterogeneous system appearance, there is not an effective fault tolerance method come out, therefore, towards this situation, this paper will apply the traditional fault tolerance technique—application-level checkpointing to heterogeneous system. Cause the main solution of reducing overhead of the application-level checkpointing is reducing checkpoint data size, so after analyzing the heterogeneous system and GPGPU program, we propose a method to optimize the data storage of application-level checkpointing technique and validate its optimization by experiments.

[1]  Kai Li,et al.  Diskless Checkpointing , 1998, IEEE Trans. Parallel Distributed Syst..

[2]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[3]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[4]  Erik Seligman,et al.  Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[5]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[6]  William J. Dally,et al.  Programmable Stream Processors , 2003, Computer.

[7]  Arie E. Kaufman,et al.  GPU Cluster for High Performance Computing , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[8]  Sudhanva Gurumurthi,et al.  Towards Transient Fault Tolerance for Heterogeneous Computing Platforms , 2008 .

[9]  B. Ramkumar,et al.  Portable checkpointing for heterogeneous architectures , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[10]  Jung Ho Ahn,et al.  Merrimac: Supercomputing with Streams , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[11]  Jens H. Krüger,et al.  GPGPU: general purpose computation on graphics hardware , 2004, SIGGRAPH '04.

[12]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[13]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[14]  Charng-Da Lu,et al.  Reliability challenges in large systems , 2006, Future Gener. Comput. Syst..

[15]  Keshav Pingali,et al.  Automatic application-level checkpointing for high performance computing systems , 2006 .