Exploring the interplay of resilience and energy consumption for a task-based partial differential equations preconditioner

Abstract We discuss algorithm-based resilience to silent data corruptions (SDCs) in a task-based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDCs. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ∼51 K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single and double bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamic voltage/frequency scaling, and its interplay with fault-rates, and application overhead.

[1]  Kurt B. Ferreira,et al.  Fault-tolerant linear solvers via selective reliability , 2012, ArXiv.

[2]  Cosmin Safta,et al.  Fault Resilient Domain Decomposition Preconditioner for PDEs , 2015, SIAM J. Sci. Comput..

[3]  Xiang Pan,et al.  Booster: Reactive core acceleration for mitigating the effects of process variation and application imbalance in low-voltage chips , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[4]  Rami G. Melhem,et al.  Analysis of an energy efficient optimistic TMR scheme , 2004, Proceedings. Tenth International Conference on Parallel and Distributed Systems, 2004. ICPADS 2004..

[5]  Pedro C. Diniz Exascale Programming Challenges , 2011 .

[6]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[7]  Frank Mueller,et al.  A Numerical Soft Fault Model for Iterative Linear Solvers , 2015, HPDC.

[8]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[9]  Robert C. Aitken,et al.  Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[10]  Rami Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, ICCAD 2004.

[11]  Nam Sung Kim,et al.  Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[12]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[13]  Frank Mueller,et al.  Evaluating the Impact of SDC on the GMRES Iterative Solver , 2013, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[14]  Christian Engelmann,et al.  Toward a Performance/Resilience Tool for Hardware/Software Co-design of High-Performance Computing Systems , 2013, 2013 42nd International Conference on Parallel Processing.

[15]  Frank Mueller,et al.  Tolerating Silent Data Corruption in Opaque Preconditioners , 2014, ArXiv.

[16]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[17]  I. Daubechies,et al.  Iteratively reweighted least squares minimization for sparse recovery , 2008, 0807.0575.

[18]  Rami G. Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[19]  Cosmin Safta,et al.  Discrete A Priori Bounds for the Detection of Corrupted PDE Solutions in Exascale Computations , 2017, SIAM J. Sci. Comput..

[20]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[21]  Mark Frederick Hoemmen,et al.  An Overview of Trilinos , 2003 .

[22]  Shuaiwen Song,et al.  Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[23]  John Shalf,et al.  Abstract Machine Models and Proxy Architectures for Exascale Computing , 2014, 2014 Hardware-Software Co-Design for High Performance Computing.