Fault mitigation strategies for CUDA GPUs

High computation is a predominant requirement in many applications. In this field, Graphic Processing Units (GPUs) are more and more adopted. Low prices and high parallelism let GPUs be attractive, even in safety critical applications. Nonetheless, new methodologies must be studied and developed to increase the dependability of GPUs. This paper presents effective fault mitigation strategies for CUDA-based GPUs against permanent faults. The methodology to apply these strategies, on the software to be executed, is fully described and verified. The graceful performance degradation achieved by the proposed technique outperforms multithreaded CPU implementation, even in presence of multiple permanent faults.

[1]  Paolo Prinetto,et al.  A software-based self test of CUDA Fermi GPUs , 2013, 2013 18th IEEE European Test Symposium (ETS).

[2]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[3]  Cauligi S. Raghavendra,et al.  All-To-All Broadcast and Matrix Multiplication in Faulty SIMD Hypercubes , 1998, IEEE Trans. Parallel Distributed Syst..

[4]  Alessandro Strano,et al.  Exploiting structural redundancy of SIMD accelerators for their built-in self-testing/diagnosis and reconfiguration , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[5]  Hui Liu,et al.  Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[6]  Yufei Lin,et al.  HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems , 2010, 2010 5th International Conference on Computer Science & Education.

[7]  Cauligi S. Raghavendra,et al.  Global Commutative and Associative Reduction Operations in Faulty SIMD Hypercubes , 1996, IEEE Trans. Computers.

[8]  Qinglei Hu,et al.  Robust fault-tolerant control for spacecraft attitude stabilisation subject to input saturation , 2011 .

[9]  Nitin H. Vaidya,et al.  An improved approach to fault tolerant rank order filtering on a SIMD mesh processor , 1995, Proceedings of International Workshop on Defect and Fault Tolerance in VLSI.

[10]  Nian Zhang Investigation of Fault-Tolerant Adaptive Filtering for Noisy ECG Signals , 2007, 2007 IEEE Symposium on Computational Intelligence in Image and Signal Processing.

[11]  Haklin Kimm,et al.  Integrated Fault Tolerant System for Automotive Bus Networks , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[12]  Fabrizio Lombardi,et al.  Fault-tolerant rank order filtering for image enhancement , 1999, IEEE Trans. Consumer Electron..

[13]  Claus Braun,et al.  Algorithmen-basierte Fehlertoleranz für Many-Core-Architekturen , 2010, it Inf. Technol..