Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems

Accelerators make the task of building systems that are re-silient against transient errors like voltage noise and soft errors hard. Architects integrate accelerators into the system as black box third-party IP components. So a fault in one or more accelerators may threaten the system's reliability if there are no established failure semantics for how an error propagates from the accelerator to the main CPU. Existing solutions that assure system reliability come at the cost of sacrificing accelerator generality, efficiency, and incur significant overhead, even in the absence of errors. To over-come these drawbacks, we examine reliability management of accelerator systems via hardware-software co-design, coupling an efficient architecture design with compiler and run-time support, to cope with transient errors. We introduce asymmetric resilience that architects reliability at the system level, centered around a hardened CPU, rather than at the accelerator level. At runtime, the system exploits task-level idempotency to contain accelerator errors and use memory protection instead of taking checkpoints to mitigate over-heads. We also leverage the fact that errors rarely occur in systems, and exploit the trade-off between error recovery performance and improved error-free performance to enhance system efficiency. Using GPUs, which are at the fore-front of accelerator systems, we demonstrate how our system architecture manages reliability in both integrated and discrete systems, under voltage-noise and soft-error related faults, leading to extremely low overhead (less than 1%) and substantial gains (20% energy savings on average).

[1]  Bishop Brock,et al.  Active management of timing guardband to save energy in POWER7 , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Devesh Tiwari,et al.  Compiler-Directed Lightweight Checkpointing for Fine-Grained Guaranteed Soft Error Recovery , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Gu-Yeon Wei,et al.  Benchmarking TPU, GPU, and CPU Platforms for Deep Learning , 2019, ArXiv.

[4]  Meeta Sharma Gupta,et al.  Understanding Soft Error Resiliency of Blue Gene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Jingwen Leng,et al.  Energy Efficiency Benefits of Reducing the Voltage Guardband on the Kepler GPU Architecture , 2014 .

[6]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[7]  Amin Ansari,et al.  Encore: Low-cost, fine-grained transient fault recovery , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Michael D. Smith,et al.  Voltage Smoothing: Characterizing and Mitigating Voltage Noise in Production Processors via Software-Guided Thread Scheduling , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Pradip Bose,et al.  Voltage Noise in Multi-Core Processors: Empirical Characterization and Optimization Opportunities , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[11]  Sanjay Pant,et al.  A self-tuning DVS processor using delay-error detection and correction , 2005, IEEE Journal of Solid-State Circuits.

[12]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[13]  David A. Wood,et al.  LogCA: A high-level performance model for hardware accelerators , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[14]  Frank Mueller,et al.  Snapify: capturing snapshots of offload applications on xeon phi manycore processors , 2014, HPDC '14.

[15]  Kwok Kee Wei,et al.  A Survey of SQL Language , 1993 .

[16]  David A. Patterson,et al.  A new golden age for computer architecture , 2019, Commun. ACM.

[17]  Rajesh K. Gupta,et al.  Reliability-Aware Data Placement for Heterogeneous Memory Architecture , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[19]  Xin He,et al.  Voltage-Stacked GPUs: A Control Theory Driven Cross-Layer Solution for Practical Voltage Stacking in GPUs , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Margaret Martonosi,et al.  Graphicionado: A high-performance and energy-efficient accelerator for graph analytics , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Pradip Bose,et al.  BRAVO: Balanced Reliability-Aware Voltage Optimization , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Radu Teodorescu,et al.  Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors , 2013, ISCA.

[24]  Pavan Balaji,et al.  VOCL-FT: introducing techniques for efficient soft error coprocessor recovery , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[25]  Jacob A. Abraham,et al.  CEDA: control-flow error detection through assertions , 2006, 12th IEEE International On-Line Testing Symposium (IOLTS'06).

[26]  Dionisios N. Pnevmatikatos,et al.  The DeSyRe Runtime Support for Fault-Tolerant Embedded MPSoCs , 2014, 2014 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[27]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[28]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[29]  Jinsuk Chung,et al.  Containment domains: A scalable, efficient, and flexible resilience scheme for exascale systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[30]  Michael Sullivan,et al.  CRUM: Checkpoint-Restart Support for CUDA's Unified Memory , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[31]  William J. Dally,et al.  Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly , 2018, USENIX Annual Technical Conference.

[32]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[33]  Li Zhou,et al.  Core tunneling: Variation-aware voltage noise mitigation in GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[34]  Pradip Bose,et al.  Safe limits on voltage reduction efficiency in GPUs: A direct measurement approach , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Paolo A. Aseron,et al.  A 45 nm Resilient Microprocessor Core for Dynamic Variation Tolerance , 2011, IEEE Journal of Solid-State Circuits.

[36]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[37]  Abhinav Vishnu,et al.  Codesign Challenges for Exascale Systems: Performance, Power, and Reliability , 2011, Computer.

[38]  Karthikeyan Sankaralingam,et al.  Idempotent processor architecture , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[39]  Jingwen Leng,et al.  GPU voltage noise: Characterization and hierarchical smoothing of spatial and temporal voltage noise interference in GPU architectures , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[40]  Albert Meixner,et al.  Error Detection via Online Checking of Cache Coherence with Token Coherence Signatures , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[41]  Christoforos E. Kozyrakis,et al.  Convolution engine , 2015, Commun. ACM.

[42]  David A. Wood,et al.  Border control: Sandboxing accelerators , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[43]  Christof Fetzer,et al.  ELZAR: Triple Modular Redundancy Using Intel AVX (Practical Experience Report) , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[44]  Bin Nie,et al.  A large-scale study of soft-errors on GPUs in the field , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[45]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[46]  Devesh Tiwari,et al.  Clover: Compiler Directed Lightweight Soft Error Resilience , 2015, LCTES.

[47]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[48]  Radu Teodorescu,et al.  EmerGPU: Understanding and mitigating resonance-induced voltage noise in GPU architectures , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[49]  George Konidaris,et al.  The microarchitecture of a real-time robot motion planning accelerator , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[50]  Hiroaki Kobayashi,et al.  CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[51]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2005, IEEE Transactions on Dependable and Secure Computing.

[52]  Hiroaki Kobayashi,et al.  CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[53]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[54]  Asim Kadav,et al.  Tolerating hardware device failures in software , 2009, SOSP '09.

[55]  Asim Kadav,et al.  Fine-grained fault tolerance using device checkpoints , 2013, ASPLOS '13.

[56]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[57]  Xiaowei Li,et al.  FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[58]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[59]  Lizy Kurian John,et al.  AUDIT: Stress Testing the Automatic Way , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[60]  Rajeev Balasubramonian,et al.  Power Efficient Approaches to Redundant Multithreading , 2007, IEEE Transactions on Parallel and Distributed Systems.

[61]  Shidhartha Das,et al.  Harnessing Voltage Margins for Energy Efficiency in Multicore CPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[62]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[63]  Karthikeyan Sankaralingam,et al.  iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[64]  David A. Wood,et al.  gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.

[65]  Xiang Pan,et al.  VRSync: Characterizing and eliminating synchronization-induced voltage emergencies in many-core processors , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[66]  Satoshi Matsuoka,et al.  NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[67]  Stijn Eyerman,et al.  Reliability-Aware Scheduling on Heterogeneous Multicore Processors , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[68]  Michel Dubois,et al.  Core Reliability: Leveraging Hardware Transactional Memory , 2018, IEEE Computer Architecture Letters.

[69]  Osman S. Unsal,et al.  Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[70]  Oreste Villa,et al.  NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs , 2019, MICRO.

[71]  Ada Gavrilovska,et al.  HeteroCheckpoint: Efficient Checkpointing for Accelerator-Based Systems , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[72]  Kenneth A. Ross,et al.  Q100: the architecture and design of a database processing unit , 2014, ASPLOS.

[73]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[74]  David I. August,et al.  Automatic Instruction-Level Software-Only Recovery , 2006, IEEE Micro.

[75]  Ismail Akturk,et al.  Trading Computation for Communication: A Taxonomy of Data Recomputation Techniques , 2018 .