Programmer-directed partial redundancy for resilient HPC

In this work we propose partial task replication and checkpointing for task-parallel HPC applications to mitigate silent data corruption (SDC) errors. As the complete replication of all application tasks can be prohibitive due to resource costs, we introduce programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs. Results show that our scheme detects and corrects around 65% of SDC errors with only 4% overhead on average.

[1]  Barbara M. Chapman,et al.  A Prototype Implementation of OpenMP Task Dependency Support , 2013, IWOMP.

[2]  Alejandro Duran,et al.  Support for OpenMP tasks in Nanos v4 , 2007, CASCON.

[3]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[4]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[5]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..