A Methodology for Comparing the Reliability of GPU-Based and CPU-Based HPCs

Today, GPUs are widely used as coprocessors/accelerators in High-Performance Heterogeneous Computing due to their many advantages. However, many researches emphasize that GPUs are not as reliable as desired yet. Despite the fact that GPUs are more vulnerable to hardware errors than CPUs, the use of GPUs in HPCs is increasing more and more. Moreover, due to native reliability problems of GPUs, combining a great number of GPUs with CPUs can significantly increase HPCs’ failure rates. For this reason, analyzing the reliability characteristics of GPU-based HPCs has become a very important issue. Therefore, in this study we evaluate the reliability of GPU-based HPCs. For this purpose, we first examined field data analysis studies for GPU-based and CPU-based HPCs and identified factors that could increase systems failure/error rates. We then compared GPU-based HPCs with CPU-based HPCs in terms of reliability with the help of these factors in order to point out reliability challenges of GPU-based HPCs. Our primary goal is to present a study that can guide the researchers in this field by indicating the current state of GPU-based heterogeneous HPCs and requirements for the future, in terms of reliability. Our second goal is to offer a methodology to compare the reliability of GPU-based HPCs and CPU-based HPCs. To the best of our knowledge, this is the first survey study to compare the reliability of GPU-based and CPU-based HPCs in a systematic manner.

[1]  Sarita V. Adve,et al.  Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[3]  Franck Cappello,et al.  Low-overhead diskless checkpoint for hybrid computing systems , 2010, 2010 International Conference on High Performance Computing.

[4]  Franck Cappello,et al.  LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[5]  Rajeev Thakur,et al.  A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[6]  Jingling Xue,et al.  PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs , 2012, Journal of Computer Science and Technology.

[7]  Christian Engelmann,et al.  Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[8]  Paolo Rech,et al.  Analyzing the criticality of transient faults-induced SDCS on GPU applications , 2017, ScalA@SC.

[9]  Saurabh Gupta,et al.  Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Bo Fang,et al.  Poster: Evaluating Error Resiliency of GPGPU Applications , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[12]  Pei Li,et al.  A Failure Prediction-Based Adaptive Checkpointing Method with Less Reliance on Temperature Monitoring for HPC Applications , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[13]  Stephen L. Scott,et al.  Reliability of a System of k Nodes for High Performance Computing Applications , 2010, IEEE Transactions on Reliability.

[14]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[15]  Qiang Guan,et al.  Lifetime memory reliability data from the field , 2017, 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).

[16]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[17]  Bianca Schroeder,et al.  Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[18]  Elmira Yu. Kalimulina,et al.  Analysis of system reliability with control, dependent failures, and arbitrary repair times , 2015, Int. J. Syst. Assur. Eng. Manag..

[19]  Ada Gavrilovska,et al.  HeteroCheckpoint: Efficient Checkpointing for Accelerator-Based Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[20]  Nathan DeBardeleben,et al.  Extra Bits on SRAM and DRAM Errors - More Data from the Field. , 2014 .

[21]  Rajeev Thakur,et al.  A study of dynamic meta-learning for failure prediction in large-scale systems , 2010, J. Parallel Distributed Comput..

[22]  Franck Cappello,et al.  Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[24]  Paolo Rech,et al.  Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience , 2017, 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS).

[25]  Ravishankar K. Iyer,et al.  Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo , 2019, ArXiv.

[26]  Arun K. Somani,et al.  Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs , 2010, 2010 IEEE International Conference on Electro/Information Technology.

[27]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[28]  Xin Fu,et al.  Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[29]  Sarita V. Adve,et al.  Relyzer: Application Resiliency Analyzer for Transient Faults , 2013, IEEE Micro.

[30]  Mikko H. Lipasti,et al.  Precision-aware soft error protection for GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[31]  Onur Mutlu,et al.  Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.

[32]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[33]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[34]  Matthias Weber,et al.  Automatic Analysis of Large Data Sets: A Walk-Through on Methods from Different Perspectives , 2013, 2013 International Conference on Cloud Computing and Big Data.

[35]  Charng-Da Lu Failure Data Analysis of HPC Systems , 2013, ArXiv.

[36]  Satoshi Matsuoka,et al.  A high-performance fault-tolerant software framework for memory on commodity GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[37]  Wei Xu,et al.  What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[38]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[39]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[40]  Bo Fang,et al.  Towards Building Error Resilient GPGPU Applications , 2012 .

[41]  Swann Perarnau,et al.  Monitoring strategies for scalable dynamic checkpointing , 2016, 2016 Seventh International Green and Sustainable Computing Conference (IGSC).

[42]  Stefano Di Carlo,et al.  Multi-faceted microarchitecture level reliability characterization for NVIDIA and AMD GPUs , 2018, 2018 IEEE 36th VLSI Test Symposium (VTS).

[43]  Saurabh Gupta,et al.  Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[44]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[45]  Franck Cappello,et al.  Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[46]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[47]  Anthony A. Maciejewski,et al.  An Analysis of Resilience Techniques for Exascale Computing Platforms , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[48]  David R. Kaeli,et al.  Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design , 2015, 2015 IEEE 33rd VLSI Test Symposium (VTS).

[49]  Christian Engelmann,et al.  Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.

[50]  Yun Zhou,et al.  The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.

[51]  Ricardo Reis,et al.  A fast and scalable fault injection framework to evaluate multi/many-core soft error reliability , 2015, 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[52]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[53]  Franck Cappello,et al.  Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[54]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[55]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[56]  Onur Mutlu,et al.  Research Problems and Opportunities in Memory Systems , 2014, Supercomput. Front. Innov..

[57]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[58]  Yufei Lin,et al.  HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems , 2010, 2010 5th International Conference on Computer Science & Education.

[59]  Stephen L. Scott,et al.  Reliability-aware resource allocation in HPC systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[60]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[61]  Domenico Cotroneo,et al.  Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[62]  Nicholas P. Cardo,et al.  Detecting and Managing GPU Failures , 2015 .

[63]  Olaf Spinczyk,et al.  FAIL*: An Open and Versatile Fault-Injection Framework for the Assessment of Software-Implemented Hardware Fault Tolerance , 2015, 2015 11th European Dependable Computing Conference (EDCC).

[64]  Jon Stearley,et al.  Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[65]  M. Anusha,et al.  Big Data-Survey , 2016 .

[66]  Franck Cappello,et al.  Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[67]  Mohamed Zahran,et al.  Heterogeneous Computing: Here to Stay , 2016, ACM Queue.

[68]  Laura Monroe,et al.  GPU Behavior on a Large HPC Cluster , 2013, Euro-Par Workshops.

[69]  Zhiling Lan,et al.  A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[70]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[71]  Bo Fang,et al.  Abstract: Evaluating Error Resiliency of GPGPU Applications , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[72]  Alexander Aiken,et al.  Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[73]  Dimitris Gizopoulos,et al.  GUFI: A framework for GPUs reliability assessment , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[74]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[75]  Bin Nie,et al.  Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[76]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[77]  Taniya Siddiqua,et al.  Analysis and Modeling of Memory Errors from Large-scale Field Data Collection , 2013 .

[78]  Luigi Carro,et al.  Neutron radiation test of graphic processing units , 2012, 2012 IEEE 18th International On-Line Testing Symposium (IOLTS).

[79]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[80]  Osman S. Unsal,et al.  Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[81]  Sarita V. Adve,et al.  HeteroSync: A benchmark suite for fine-grained synchronization on tightly coupled GPUs , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).

[82]  Dimitris Gizopoulos,et al.  Performance-aware reliability assessment of heterogeneous chips , 2017, 2017 IEEE 35th VLSI Test Symposium (VTS).

[83]  Hiroaki Kobayashi,et al.  CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[84]  Ting Li,et al.  Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems , 2013, ParCo 2013.

[85]  David Defour,et al.  GPUburn: A system to test and mitigate GPU hardware failures , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[86]  Christopher D. Carothers,et al.  An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..

[87]  Satoshi Matsuoka,et al.  NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[88]  Bo Fang,et al.  A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[89]  Domenico Cotroneo,et al.  Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[90]  L. Carro,et al.  An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.

[91]  S. Scott,et al.  Reliability Analysis in HPC clusters , 2006 .

[92]  Bin Nie,et al.  A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[93]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[94]  Nathan DeBardeleben,et al.  Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[95]  Qiang Guan,et al.  Improving DRAM Fault Characterization through Machine Learning , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[96]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[97]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[98]  Joel Emer,et al.  SASSIFI : Evaluating Resilience of GPU Applications , 2015 .

[99]  Thanadech Thanakornworakij,et al.  The Effect of Correlated Failure on the Reliability of HPC Systems , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications Workshops.

[100]  Luigi Carro,et al.  Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[101]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[102]  Jeffrey S. Vetter,et al.  A Survey of Techniques for Modeling and Improving Reliability of Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[103]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[104]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[105]  Ling Huang,et al.  Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[106]  Al Geist,et al.  A survey of high-performance computing scaling challenges , 2017, Int. J. High Perform. Comput. Appl..

[107]  John Shalf,et al.  Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.

[108]  Aparna Chandramowlishwaran,et al.  cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[109]  Jie Wu,et al.  Sustainable GPU Computing at Scale , 2011, 2011 14th IEEE International Conference on Computational Science and Engineering.

[110]  Bin Nie,et al.  Machine Learning Models for GPU Error Prediction in a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[111]  Franck Cappello,et al.  Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..

[112]  Jon Stearley,et al.  A State-Machine Approach to Disambiguating Supercomputer Event Logs , 2012, MAD.

[113]  Stefano Di Carlo,et al.  SIFI: AMD southern islands GPU microarchitectural level fault injector , 2017, 2017 IEEE 23rd International Symposium on On-Line Testing and Robust System Design (IOLTS).

[114]  Qiang Wu,et al.  Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[115]  Elmira Yu. Kalimulina,et al.  Analysis of System Reliability with Control, Dependent Failures, and Arbitrary Repair Times , 2015 .

[116]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[117]  Thanadech Thanakornworakij,et al.  Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications , 2013, Int. J. High Perform. Comput. Appl..

[118]  Jeffrey S. Vetter,et al.  A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..

[119]  Nathan DeBardeleben,et al.  An investigation of the effects of hard and soft errors on graphics processing unit‐accelerated molecular dynamics simulations , 2014, Concurr. Comput. Pract. Exp..

[120]  Kevin Skadron,et al.  A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.

[121]  Satoshi Matsuoka,et al.  Software-Based ECC for GPUs , 2011 .