Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer

Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs – the Machine Check Architecture (MCA) log and the job scheduler log – we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.

[1]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[2]  Saurabh Gupta,et al.  A Multi-faceted Approach to Job Placement for Improved Performance on Extreme-Scale Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[4]  Feiyi Wang,et al.  Using Balanced Data Placement to Address I/O Contention in Production Environments , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[5]  Bin Nie,et al.  Machine Learning Models for GPU Error Prediction in a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[6]  Mark F. Adams,et al.  Gyrokinetic particle simulation of neoclassical transport in the pedestal/scrape-off region of a tokamak plasma , 2006 .

[7]  Scott Klasky,et al.  Workflow automation for processing plasma fusion simulation data , 2007, WORKS '07.

[8]  Bin Nie,et al.  A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[9]  Christian Engelmann,et al.  A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[10]  Nathan DeBardeleben,et al.  Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[12]  Arie Shoshani,et al.  Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..

[13]  Karsten Schwan,et al.  Plasma fusion code coupling using scalable I/O services and scientific workflows , 2009, WORKS '09.

[14]  Christian Engelmann,et al.  Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Scott Atchley,et al.  GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Raghul Gunasekaran,et al.  Scientific User Behavior and Data-Sharing Trends in a Petascale File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Esteban Meneses,et al.  Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer , 2015 .

[18]  Alok N. Choudhary,et al.  A flexible I/O arbitration framework for netCDF‐based big data processing workflows on high‐end supercomputers , 2017, Concurr. Comput. Pract. Exp..