Trade-offs in execution signature compression for reliable processor systems

As semiconductor processes scale, making transistors more vulnerable to transient upset, a wide variety of microarchitectural and system-level strategies are emerging to perform efficient error detection and correction computer systems. While these approaches often target various application domains and address error detection and correction at different granularities and with different overheads, an emerging trend is the use of state compression, e.g., cyclic redundancy check (CRC), to reduce the cost of redundancy checking. Prior work in the literature has shown that Fletcher's checksum (FC), while less effective where error detection probability is concerned, is less computationally complex when implemented in software than the more-effective CRC. In this paper, we reexamine the suitability of CRC and FC as compression algorithms when implemented in hardware for embedded safety-critical systems. We have developed and evaluated parameterizable implementations of CRC and FC in FPGA, and we observe that what was true for software implementations does not hold in hardware: CRC is more efficient than FC across a wide variety of target input bandwidths and compression strengths.

[1]  Brett H. Meyer,et al.  Rapid, Tunable Error Detection with Execution Fingerprinting , 2013 .

[2]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor , 2008, IBM J. Res. Dev..

[3]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[4]  Hesham F. A. Hamed,et al.  Implementation of Low Area and High Data Throughput CRC Design on FPGA , 2012 .

[5]  Feng Zhao,et al.  Energy aware consolidation for cloud computing , 2008, CLUSTER 2008.

[6]  Rakesh Kumar,et al.  Towards scalable reliability frameworks for error prone CMPs , 2009, CASES '09.

[7]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[8]  V.B. Prasad,et al.  Fault tolerant digital systems , 1989, IEEE Potentials.

[9]  Peter M. Chen,et al.  The impact of recovery mechanisms on the likelihood of saving corrupted state , 2002, 13th International Symposium on Software Reliability Engineering, 2002. Proceedings..

[10]  Robert C. Aitken,et al.  Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[11]  Eyal de Lara,et al.  SnowFlock: rapid virtual machine cloning for cloud computing , 2009, EuroSys '09.

[12]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[13]  Shlomo Weiss,et al.  DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals , 2008, IEEE Computer Architecture Letters.

[14]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[15]  Suman Nath,et al.  Energy-Aware Server Provisioning and Load Dispatching for Connection-Intensive Internet Services , 2008, NSDI.

[16]  Alberto L. Sangiovanni-Vincentelli,et al.  Fault-tolerant platforms for automotive safety-critical applications , 2003, CASES '03.

[17]  Kewal K. Saluja,et al.  Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[18]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[19]  T. C. Maxino,et al.  The Effectiveness of Checksums for Embedded Control Networks , 2009, IEEE Transactions on Dependable and Secure Computing.

[20]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[21]  Kevin Skadron,et al.  Cost-effective safety and fault localization using distributed temporal redundancy , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[22]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[23]  田村 芳明,et al.  Kemari: Virtual Machine Synchronization for Fault Tolerance , 2010 .

[24]  Kevin Skadron,et al.  Evaluating Overheads of Multibit Soft-Error Protection in the Processor Core , 2013, IEEE Micro.

[25]  Henri Casanova,et al.  Measuring the Performance and Reliability of Production Computational Grids , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[26]  J. G. Fletcher,et al.  An Arithmetic Checksum for Serial Transmissions , 1982, IEEE Trans. Commun..

[27]  Philip Koopman,et al.  32-bit cyclic redundancy codes for Internet applications , 2002, Proceedings International Conference on Dependable Systems and Networks.

[28]  Ian T. Foster,et al.  Condor-G: A Computation Management Agent for Multi-Institutional Grids , 2004, Cluster Computing.

[29]  Philip Koopman,et al.  Efficient High Hamming Distance CRCs for Embedded Networks , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[30]  Yue Gao,et al.  An energy and deadline aware resource provisioning, scheduling and optimization framework for cloud systems , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[31]  Simsy Xavier A Survey of Various Workflow Scheduling Algorithms in Cloud Environment , 2013 .

[32]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[33]  Mathys Walma Pipelined Cyclic Redundancy Check (CRC) Calculation , 2007, 2007 16th International Conference on Computer Communications and Networks.

[34]  David E. Goldberg,et al.  Genetic algorithms and Machine Learning , 1988, Machine Learning.

[35]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[36]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[37]  Victor P. Nelson Fault-tolerant computing: fundamental concepts , 1990, Computer.