Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer
暂无分享,去创建一个
[1] Luigi Carro,et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[2] Saurabh Gupta,et al. A Multi-faceted Approach to Job Placement for Improved Performance on Extreme-Scale Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[4] Feiyi Wang,et al. Using Balanced Data Placement to Address I/O Contention in Production Environments , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).
[5] Bin Nie,et al. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[6] Mark F. Adams,et al. Gyrokinetic particle simulation of neoclassical transport in the pedestal/scrape-off region of a tokamak plasma , 2006 .
[7] Scott Klasky,et al. Workflow automation for processing plasma fusion simulation data , 2007, WORKS '07.
[8] Bin Nie,et al. A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[9] Christian Engelmann,et al. A Big Data Analytics Framework for HPC Log Data: Three Case Studies Using the Titan Supercomputer Log , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[10] Nathan DeBardeleben,et al. Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] Vijay S. Pande,et al. Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[12] Arie Shoshani,et al. Hello ADIOS: the challenges and lessons of developing leadership class I/O frameworks , 2014, Concurr. Comput. Pract. Exp..
[13] Karsten Schwan,et al. Plasma fusion code coupling using scalable I/O services and scientific workflows , 2009, WORKS '09.
[14] Christian Engelmann,et al. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[15] Scott Atchley,et al. GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[16] Raghul Gunasekaran,et al. Scientific User Behavior and Data-Sharing Trends in a Petascale File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[17] Esteban Meneses,et al. Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer , 2015 .
[18] Alok N. Choudhary,et al. A flexible I/O arbitration framework for netCDF‐based big data processing workflows on high‐end supercomputers , 2017, Concurr. Comput. Pract. Exp..