Poster: Programming Model Extensions for Resilience in Extreme Scale Computing

System resilience is a key challenge to building extreme scale systems. A large number of HPC applications are inherently resilient, but application programmers lack mechanisms to convey their fault tolerance knowledge to the system. We present a cross-layer approach to resilience in which we propose a set of programming model extensions and develop a runtime inference framework that can reason about the context and significance of faults, as they occur, to the application programmer's fault tolerance expectations. We demonstrate using a set accelerated fault injection experiments the validity of our approach with a set of real scientific and engineering codes. Our experiments show that a cross-layer approach that explicitly engages the programmer in expressing fault tolerance knowledge which is then leveraged across the layers of system abstraction can significantly improve the dependability of long running HPC applications.

[1]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[2]  Robert F. Lucas,et al.  Programming Model Extensions for Resilience in Extreme Scale Computing , 2012, Euro-Par Workshops.

[3]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[4]  Robert F. Lucas,et al.  A programming model for resilience in extreme scale computing , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[5]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[6]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[7]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .