Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems
暂无分享,去创建一个
Minyi Guo | Quan Chen | Pradip Bose | Jingwen Leng | Vijay Janapa Reddi | Alper Buyuktosunoglu | Ramon Bertran Monfort | V. Reddi | Quan Chen | M. Guo | A. Buyuktosunoglu | P. Bose | Jingwen Leng
[1] Bishop Brock,et al. Active management of timing guardband to save energy in POWER7 , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[2] Devesh Tiwari,et al. Compiler-Directed Lightweight Checkpointing for Fine-Grained Guaranteed Soft Error Recovery , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] Gu-Yeon Wei,et al. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning , 2019, ArXiv.
[4] Meeta Sharma Gupta,et al. Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[5] Jingwen Leng,et al. Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture , 2014 .
[6] Hyeran Jeon,et al. Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[7] Amin Ansari,et al. Encore: Low-cost, fine-grained transient fault recovery , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[8] Michael D. Smith,et al. Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Pradip Bose,et al. Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[11] Sanjay Pant,et al. A self-tuning DVS processor using delay-error detection and correction , 2005, IEEE Journal of Solid-State Circuits.
[12] David Blaauw,et al. Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.
[13] David A. Wood,et al. LogCA: A high-level performance model for hardware accelerators , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[14] Frank Mueller,et al. Snapify: capturing snapshots of offload applications on xeon phi manycore processors , 2014, HPDC '14.
[15] Kwok Kee Wei,et al. A Survey of SQL Language , 1993 .
[16] David A. Patterson,et al. A new golden age for computer architecture , 2019, Commun. ACM.
[17] Rajesh K. Gupta,et al. Reliability-Aware Data Placement for Heterogeneous Memory Architecture , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[18] Sudhakar Yalamanchili,et al. Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[19] Xin He,et al. Voltage-Stacked GPUs: A Control Theory Driven Cross-Layer Solution for Practical Voltage Stacking in GPUs , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[20] Margaret Martonosi,et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[21] Pradip Bose,et al. BRAVO: Balanced Reliability-Aware Voltage Optimization , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[22] Sudhanva Gurumurthi,et al. Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[23] Radu Teodorescu,et al. Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors , 2013, ISCA.
[24] Pavan Balaji,et al. VOCL-FT: introducing techniques for efficient soft error coprocessor recovery , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[25] Jacob A. Abraham,et al. CEDA: control-flow error detection through assertions , 2006, 12th IEEE International On-Line Testing Symposium (IOLTS'06).
[26] Dionisios N. Pnevmatikatos,et al. The DeSyRe Runtime Support for Fault-Tolerant Embedded MPSoCs , 2014, 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications.
[27] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .
[28] Luigi Carro,et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[29] Jinsuk Chung,et al. Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[30] Michael Sullivan,et al. CRUM: Checkpoint-Restart Support for CUDA's Unified Memory , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[31] William J. Dally,et al. Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly , 2018, USENIX Annual Technical Conference.
[32] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[33] Li Zhou,et al. Core tunneling: Variation-aware voltage noise mitigation in GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[34] Pradip Bose,et al. Safe limits on voltage reduction efficiency in GPUs: A direct measurement approach , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[35] Paolo A. Aseron,et al. A 45 nm Resilient Microprocessor Core for Dynamic Variation Tolerance , 2011, IEEE Journal of Solid-State Circuits.
[36] Tipp Moseley,et al. Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[37] Abhinav Vishnu,et al. Codesign Challenges for Exascale Systems: Performance, Power, and Reliability , 2011, Computer.
[38] Karthikeyan Sankaralingam,et al. Idempotent processor architecture , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[39] Jingwen Leng,et al. GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[40] Albert Meixner,et al. Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.
[41] Christoforos E. Kozyrakis,et al. Convolution engine , 2015, Commun. ACM.
[42] David A. Wood,et al. Border control: Sandboxing accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[43] Christof Fetzer,et al. ELZAR: Triple Modular Redundancy Using Intel AVX (Practical Experience Report) , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).
[44] Bin Nie,et al. A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[45] Albert Meixner,et al. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[46] Devesh Tiwari,et al. Clover: Compiler Directed Lightweight Soft Error Resilience , 2015, LCTES.
[47] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.
[48] Radu Teodorescu,et al. EmerGPU: Understanding and mitigating resonance-induced voltage noise in GPU architectures , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[49] George Konidaris,et al. The microarchitecture of a real-time robot motion planning accelerator , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[50] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.
[51] Sanjay J. Patel,et al. ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2005, IEEE Transactions on Dependable and Secure Computing.
[52] Hiroaki Kobayashi,et al. CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[53] Tianshi Chen,et al. ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[54] Asim Kadav,et al. Tolerating hardware device failures in software , 2009, SOSP '09.
[55] Asim Kadav,et al. Fine-grained fault tolerance using device checkpoints , 2013, ASPLOS '13.
[56] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[57] Xiaowei Li,et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[58] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[59] Lizy Kurian John,et al. AUDIT: Stress Testing the Automatic Way , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[60] Rajeev Balasubramonian,et al. Power Efficient Approaches to Redundant Multithreading , 2007, IEEE Transactions on Parallel and Distributed Systems.
[61] Shidhartha Das,et al. Harnessing Voltage Margins for Energy Efficiency in Multicore CPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[62] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[63] Karthikeyan Sankaralingam,et al. iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[64] David A. Wood,et al. gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.
[65] Xiang Pan,et al. VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[66] Satoshi Matsuoka,et al. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[67] Stijn Eyerman,et al. Reliability-Aware Scheduling on Heterogeneous Multicore Processors , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[68] Michel Dubois,et al. Core Reliability: Leveraging Hardware Transactional Memory , 2018, IEEE Computer Architecture Letters.
[69] Osman S. Unsal,et al. Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[70] Oreste Villa,et al. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs , 2019, MICRO.
[71] Ada Gavrilovska,et al. HeteroCheckpoint: Efficient Checkpointing for Accelerator-Based Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[72] Kenneth A. Ross,et al. Q100: the architecture and design of a database processing unit , 2014, ASPLOS.
[73] Subhasish Mitra,et al. ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).
[74] David I. August,et al. Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.
[75] Ismail Akturk,et al. Trading Computation for Communication: A Taxonomy of Data Recomputation Techniques , 2018 .