A Methodology for Comparing the Reliability of GPU-Based and CPU-Based HPCs
暂无分享,去创建一个
[1] Sarita V. Adve,et al. Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[2] Stephen W. Keckler,et al. SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[3] Franck Cappello,et al. Low-overhead diskless checkpoint for hybrid computing systems , 2010, 2010 International Conference on High Performance Computing.
[4] Franck Cappello,et al. LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events , 2017, 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).
[5] Rajeev Thakur,et al. A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).
[6] Jingling Xue,et al. PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs , 2012, Journal of Computer Science and Technology.
[7] Christian Engelmann,et al. Big Data Meets HPC Log Analytics: Scalable Approach to Understanding Systems at Extreme Scale , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).
[8] Paolo Rech,et al. Analyzing the criticality of transient faults-induced SDCS on GPU applications , 2017, ScalA@SC.
[9] Saurabh Gupta,et al. Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[10] Christian Engelmann,et al. Failures in Large Scale Systems: Long-term Measurement, Analysis, and Implications , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.
[11] Bo Fang,et al. Poster: Evaluating Error Resiliency of GPGPU Applications , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
[12] Pei Li,et al. A Failure Prediction-Based Adaptive Checkpointing Method with Less Reliance on Temperature Monitoring for HPC Applications , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[13] Stephen L. Scott,et al. Reliability of a System of k Nodes for High Performance Computing Applications , 2010, IEEE Transactions on Reliability.
[14] Joel S. Emer,et al. The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.
[15] Qiang Guan,et al. Lifetime memory reliability data from the field , 2017, 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).
[16] Franck Cappello,et al. Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..
[17] Bianca Schroeder,et al. Reading between the lines of failure logs: Understanding how HPC systems fail , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[18] Elmira Yu. Kalimulina,et al. Analysis of system reliability with control, dependent failures, and arbitrary repair times , 2015, Int. J. Syst. Assur. Eng. Manag..
[19] Ada Gavrilovska,et al. HeteroCheckpoint: Efficient Checkpointing for Accelerator-Based Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[20] Nathan DeBardeleben,et al. Extra Bits on SRAM and DRAM Errors - More Data from the Field. , 2014 .
[21] Rajeev Thakur,et al. A study of dynamic meta-learning for failure prediction in large-scale systems , 2010, J. Parallel Distributed Comput..
[22] Franck Cappello,et al. Modeling and tolerating heterogeneous failures in large parallel systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[23] Bo Fang,et al. GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[24] Paolo Rech,et al. Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience , 2017, 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS).
[25] Ravishankar K. Iyer,et al. Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo , 2019, ArXiv.
[26] Arun K. Somani,et al. Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs , 2010, 2010 IEEE International Conference on Electro/Information Technology.
[27] David W. Nellans,et al. Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[28] Xin Fu,et al. Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).
[29] Sarita V. Adve,et al. Relyzer: Application Resiliency Analyzer for Transient Faults , 2013, IEEE Micro.
[30] Mikko H. Lipasti,et al. Precision-aware soft error protection for GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[31] Onur Mutlu,et al. Memory scaling: A systems architecture perspective , 2013, 2013 5th IEEE International Memory Workshop.
[32] Franck Cappello,et al. Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..
[33] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[34] Matthias Weber,et al. Automatic Analysis of Large Data Sets: A Walk-Through on Methods from Different Perspectives , 2013, 2013 International Conference on Cloud Computing and Big Data.
[35] Charng-Da Lu. Failure Data Analysis of HPC Systems , 2013, ArXiv.
[36] Satoshi Matsuoka,et al. A high-performance fault-tolerant software framework for memory on commodity GPUs , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[37] Wei Xu,et al. What Can We Learn from Four Years of Data Center Hardware Failures? , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[38] Ravishankar K. Iyer,et al. Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[39] John Shalf,et al. The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..
[40] Bo Fang,et al. Towards Building Error Resilient GPGPU Applications , 2012 .
[41] Swann Perarnau,et al. Monitoring strategies for scalable dynamic checkpointing , 2016, 2016 Seventh International Green and Sustainable Computing Conference (IGSC).
[42] Stefano Di Carlo,et al. Multi-faceted microarchitecture level reliability characterization for NVIDIA and AMD GPUs , 2018, 2018 IEEE 36th VLSI Test Symposium (VTS).
[43] Saurabh Gupta,et al. Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[44] John Shalf,et al. Exascale Computing Technology Challenges , 2010, VECPAR.
[45] Franck Cappello,et al. Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.
[46] Vijay S. Pande,et al. Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[47] Anthony A. Maciejewski,et al. An Analysis of Resilience Techniques for Exascale Computing Platforms , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).
[48] David R. Kaeli,et al. Field, experimental, and analytical data on large-scale HPC systems and evaluation of the implications for exascale system design , 2015, 2015 IEEE 33rd VLSI Test Symposium (VTS).
[49] Christian Engelmann,et al. Blue Gene/L Log Analysis and Time to Interrupt Estimation , 2009, 2009 International Conference on Availability, Reliability and Security.
[50] Yun Zhou,et al. The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.
[51] Ricardo Reis,et al. A fast and scalable fault injection framework to evaluate multi/many-core soft error reliability , 2015, 2015 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).
[52] Anand Sivasubramaniam,et al. BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).
[53] Franck Cappello,et al. Reducing Waste in Extreme Scale Systems through Introspective Analysis , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[54] Eduardo Pinheiro,et al. DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.
[55] Sudhanva Gurumurthi,et al. Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[56] Onur Mutlu,et al. Research Problems and Opportunities in Memory Systems , 2014, Supercomput. Front. Innov..
[57] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[58] Yufei Lin,et al. HiAL-Ckpt: A hierarchical application-level checkpointing for CPU-GPU hybrid systems , 2010, 2010 5th International Conference on Computer Science & Education.
[59] Stephen L. Scott,et al. Reliability-aware resource allocation in HPC systems , 2007, 2007 IEEE International Conference on Cluster Computing.
[60] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..
[61] Domenico Cotroneo,et al. Assessing time coalescence techniques for the analysis of supercomputer logs , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).
[62] Nicholas P. Cardo,et al. Detecting and Managing GPU Failures , 2015 .
[63] Olaf Spinczyk,et al. FAIL*: An Open and Versatile Fault-Injection Framework for the Assessment of Software-Implemented Hardware Fault Tolerance , 2015, 2015 11th European Dependable Computing Conference (EDCC).
[64] Jon Stearley,et al. Bad Words: Finding Faults in Spirit's Syslogs , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[65] M. Anusha,et al. Big Data-Survey , 2016 .
[66] Franck Cappello,et al. Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[67] Mohamed Zahran,et al. Heterogeneous Computing: Here to Stay , 2016, ACM Queue.
[68] Laura Monroe,et al. GPU Behavior on a Large HPC Cluster , 2013, Euro-Par Workshops.
[69] Zhiling Lan,et al. A practical failure prediction with location and lead time for Blue Gene/P , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).
[70] Stephen L. Scott,et al. An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[71] Bo Fang,et al. Abstract: Evaluating Error Resiliency of GPGPU Applications , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
[72] Alexander Aiken,et al. Alert Detection in System Logs , 2008, 2008 Eighth IEEE International Conference on Data Mining.
[73] Dimitris Gizopoulos,et al. GUFI: A framework for GPUs reliability assessment , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[74] Todd M. Austin,et al. A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.
[75] Bin Nie,et al. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).
[76] Luigi Carro,et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[77] Taniya Siddiqua,et al. Analysis and Modeling of Memory Errors from Large-scale Field Data Collection , 2013 .
[78] Luigi Carro,et al. Neutron radiation test of graphic processing units , 2012, 2012 IEEE 18th International On-Line Testing Symposium (IOLTS).
[79] Bianca Schroeder,et al. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.
[80] Osman S. Unsal,et al. Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[81] Sarita V. Adve,et al. HeteroSync: A benchmark suite for fine-grained synchronization on tightly coupled GPUs , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).
[82] Dimitris Gizopoulos,et al. Performance-aware reliability assessment of heterogeneous chips , 2017, 2017 IEEE 35th VLSI Test Symposium (VTS).
[83] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.
[84] Ting Li,et al. Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems , 2013, ParCo 2013.
[85] David Defour,et al. GPUburn: A system to test and mitigate GPU hardware failures , 2013, 2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).
[86] Christopher D. Carothers,et al. An analysis of clustered failures on large supercomputing systems , 2009, J. Parallel Distributed Comput..
[87] Satoshi Matsuoka,et al. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[88] Bo Fang,et al. A Systematic Methodology for Evaluating the Error Resilience of GPGPU Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.
[89] Domenico Cotroneo,et al. Improving Log-based Field Failure Data Analysis of multi-node computing systems , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).
[90] L. Carro,et al. An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.
[91] S. Scott,et al. Reliability Analysis in HPC clusters , 2006 .
[92] Bin Nie,et al. A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[93] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[94] Nathan DeBardeleben,et al. Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[95] Qiang Guan,et al. Improving DRAM Fault Characterization through Machine Learning , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).
[96] Bianca Schroeder,et al. Understanding failures in petascale computers , 2007 .
[97] Franck Cappello,et al. Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[98] Joel Emer,et al. SASSIFI : Evaluating Resilience of GPU Applications , 2015 .
[99] Thanadech Thanakornworakij,et al. The Effect of Correlated Failure on the Reliability of HPC Systems , 2011, 2011 IEEE Ninth International Symposium on Parallel and Distributed Processing with Applications Workshops.
[100] Luigi Carro,et al. Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[101] Vilas Sridharan,et al. A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[102] Jeffrey S. Vetter,et al. A Survey of Techniques for Modeling and Improving Reliability of Computing Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.
[103] Huiyang Zhou,et al. Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.
[104] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[105] Ling Huang,et al. Online System Problem Detection by Mining Patterns of Console Logs , 2009, 2009 Ninth IEEE International Conference on Data Mining.
[106] Al Geist,et al. A survey of high-performance computing scaling challenges , 2017, Int. J. High Perform. Comput. Appl..
[107] John Shalf,et al. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.
[108] Aparna Chandramowlishwaran,et al. cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).
[109] Jie Wu,et al. Sustainable GPU Computing at Scale , 2011, 2011 14th IEEE International Conference on Computational Science and Engineering.
[110] Bin Nie,et al. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[111] Franck Cappello,et al. Toward Exascale Resilience: 2014 update , 2014, Supercomput. Front. Innov..
[112] Jon Stearley,et al. A State-Machine Approach to Disambiguating Supercomputer Event Logs , 2012, MAD.
[113] Stefano Di Carlo,et al. SIFI: AMD southern islands GPU microarchitectural level fault injector , 2017, 2017 IEEE 23rd International Symposium on On-Line Testing and Robust System Design (IOLTS).
[114] Qiang Wu,et al. Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[115] Elmira Yu. Kalimulina,et al. Analysis of System Reliability with Control, Dependent Failures, and Arbitrary Repair Times , 2015 .
[116] Anand Sivasubramaniam,et al. Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.
[117] Thanadech Thanakornworakij,et al. Reliability model of a system of k nodes with simultaneous failures for high-performance computing applications , 2013, Int. J. High Perform. Comput. Appl..
[118] Jeffrey S. Vetter,et al. A Survey of CPU-GPU Heterogeneous Computing Techniques , 2015, ACM Comput. Surv..
[119] Nathan DeBardeleben,et al. An investigation of the effects of hard and soft errors on graphics processing unit‐accelerated molecular dynamics simulations , 2014, Concurr. Comput. Pract. Exp..
[120] Kevin Skadron,et al. A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors , 2007, GH '07.
[121] Satoshi Matsuoka,et al. Software-Based ECC for GPUs , 2011 .