An improved fault mitigation strategy for CUDA Fermi GPUs

High computation is a predominant requirement in many applications. In this field, Graphic Processing Units (GPUs) are more and more adopted. Low prices and high parallelism let GPUs be attractive, even in safety critical applications. Nonetheless, new methodologies must be studied and developed to increase the dependability of GPUs. This paper presents an improved fault mitigation strategy against permanent faults for CUDA Fermi GPUs. The proposed approach exploits the reverse engineering of the block scheduling policy in CUDA Fermi GPUs in order to minimize the fault mitigation timing overhead. The graceful performance degradation achieved by the proposed technique outperforms multithreaded CPU implementations and other fault mitigation strategies for CUDA GPU, even in presence of multiple permanent faults

[1]  Hui Liu,et al.  Matrix Multiplication on GPUs with On-Line Fault Tolerance , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications.

[2]  Wai-kuen Cham,et al.  Fast Algorithm for Walsh Hadamard Transform on Sliding Windows , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yufei Lin,et al.  HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems , 2010, 2010 5th International Conference on Computer Science & Education.

[4]  Feng Liu,et al.  Matrix transpose methods for SAR imaging system , 2010, IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS.

[5]  Fabrizio Lombardi,et al.  Fault-tolerant rank order filtering for image enhancement , 1999, IEEE Trans. Consumer Electron..

[6]  Qinglei Hu,et al.  Robust fault-tolerant control for spacecraft attitude stabilisation subject to input saturation , 2011 .

[7]  Nitin H. Vaidya,et al.  An improved approach to fault tolerant rank order filtering on a SIMD mesh processor , 1995, Proceedings of International Workshop on Defect and Fault Tolerance in VLSI.

[8]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[9]  Cauligi S. Raghavendra,et al.  All-To-All Broadcast and Matrix Multiplication in Faulty SIMD Hypercubes , 1998, IEEE Trans. Parallel Distributed Syst..

[10]  Nian Zhang Investigation of Fault-Tolerant Adaptive Filtering for Noisy ECG Signals , 2007, 2007 IEEE Symposium on Computational Intelligence in Image and Signal Processing.

[11]  Haklin Kimm,et al.  Integrated Fault Tolerant System for Automotive Bus Networks , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[12]  Cauligi S. Raghavendra,et al.  Global Commutative and Associative Reduction Operations in Faulty SIMD Hypercubes , 1996, IEEE Trans. Computers.

[13]  Paolo Prinetto,et al.  A software-based self test of CUDA Fermi GPUs , 2013, 2013 18th IEEE European Test Symposium (ETS).

[14]  Paolo Prinetto,et al.  Fault mitigation strategies for CUDA GPUs , 2013, 2013 IEEE International Test Conference (ITC).

[15]  Alessandro Strano,et al.  Exploiting structural redundancy of SIMD accelerators for their built-in self-testing/diagnosis and reconfiguration , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.