OCEAN

Recent process technology advances trigger reliability issues that degrade the Quality-of-Service (QoS) required by embedded Systems-on-Chip (SoCs). To maintain the required QoS with acceptable overheads, we propose OCEAN, a novel cross-layer error mitigation. OCEAN enforces on-chip SRAMs reliability with a fault-tolerant buffer. We utilize this buffer to protect a portion of the processed data used to restore from runtime error. We optimally select the buffer size to minimize the energy overhead, with timing and area constraints. OCEAN achieves full error mitigation with 10.1p average energy overhead compared to base-line operation that does not include any error correction capability, and 65p energy savings, compared to a cross-layer error mitigation mechanism.

[1]  Qiang Xu,et al.  Lifetime reliability-aware task allocation and scheduling for MPSoC platforms , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[2]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[3]  Young-Hyun Jun,et al.  45nm low-power embedded pseudo-SRAM with ECC-based auto-adjusted self-refresh scheme , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[4]  Petru Eles,et al.  Synthesis of Fault-Tolerant Embedded Systems , 2008, 2008 Design, Automation and Test in Europe.

[5]  Virendra Singh,et al.  Fault-tolerant average execution time optimization for general-purpose multi-processor system-on-chips , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[6]  Cecilia Metra,et al.  Error correcting code analysis for cache memory high reliability and performance , 2011, 2011 Design, Automation & Test in Europe.

[7]  Yajun Ha,et al.  Multimedia Multiprocessor Systems: Analysis, Design and Management , 2010 .

[8]  Dakai Zhu,et al.  Reliability-Aware Energy Management for Periodic Real-Time Tasks , 2007, 13th IEEE Real Time and Embedded Technology and Applications Symposium (RTAS'07).

[9]  Christoph Roth,et al.  On the exploitation of the inherent error resilience of wireless systems under unreliable silicon , 2012, DAC Design Automation Conference 2012.

[10]  Nihar R. Mahapatra,et al.  Combining error masking and error detection plus recovery to combat soft errors in static CMOS circuits , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[11]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[12]  Katta G. Murty,et al.  Nonlinear Programming Theory and Algorithms , 2007, Technometrics.

[13]  Aviral Shrivastava,et al.  Mitigating the impact of hardware defects on multimedia applications: a cross-layer approach , 2008, ACM Multimedia.

[14]  Francky Catthoor,et al.  A hybrid HW-SW approach for intermittent error mitigation in streaming-based embedded systems , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  André K. Nieuwland,et al.  Combinational logic soft error analysis and protection , 2006, 12th IEEE International On-Line Testing Symposium (IOLTS'06).

[16]  R. Baumann The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction , 2002, Digest. International Electron Devices Meeting,.

[17]  Niraj K. Jha,et al.  Fault-tolerant computer system design , 1996, IEEE Parallel & Distributed Technology: Systems & Applications.

[18]  R. Morelos-Zaragoza The art of error correcting coding , 2002 .

[19]  Krishnendu Chakrabarty,et al.  Soft error-aware design optimization of low power and time-constrained embedded systems , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[20]  L. Carro,et al.  New Techniques for Improving the Performance of the Lockstep Architecture for SEEs Mitigation in FPGA Embedded Processors , 2009, IEEE Transactions on Nuclear Science.

[21]  Marco Platzner,et al.  Design and architectures for dependable embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[22]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[23]  N. Ranganathan,et al.  A Framework for Correction of Multi-Bit Soft Errors in L2 Caches Based on Redundancy , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[25]  E. A. de Kock Multiprocessor mapping of process networks: a JPEG decoding case study , 2002 .

[26]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[27]  E. Ibe,et al.  Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule , 2010, IEEE Transactions on Electron Devices.

[28]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[29]  Michel Dubois,et al.  CPPC: Correctable parity protected cache , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[30]  Norbert Wehn,et al.  A Case Study in Reliability-Aware Design: A Resilient LDPC Code Decoder , 2008, 2008 Design, Automation and Test in Europe.

[31]  Luca Benini,et al.  MPARM: Exploring the Multi-Processor SoC Design Space with SystemC , 2005, J. VLSI Signal Process..

[32]  Nanning Zheng,et al.  Leveraging Access Locality for the Efficient Use of Multibit Error-Correcting Codes in L2 Cache , 2009, IEEE Transactions on Computers.

[33]  S. Pae,et al.  Random charge effects for PMOS NBTI in ultra-small gate area devices , 2005, 2005 IEEE International Reliability Physics Symposium, 2005. Proceedings. 43rd Annual..

[34]  Marina Daecher Multimedia Multiprocessor Systems Analysis Design And Management , 2016 .

[35]  J. Jopling,et al.  Erratic fluctuations of sram cache vmin at the 90nm process technology node , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[36]  N. Seifert,et al.  Robust system design with built-in soft-error resilience , 2005, Computer.

[37]  Martin Lukasiewycz,et al.  Exploiting data-redundancy in reliability-aware networked embedded system design , 2009, CODES+ISSS '09.

[38]  Amin Ansari,et al.  StageNetSlice: a reconfigurable microarchitecture building block for resilient CMP systems , 2008, CASES '08.

[39]  Swarup Bhunia,et al.  Reliability-Driven ECC Allocation for Multiple Bit Error Resilience in Processor Cache , 2011, IEEE Transactions on Computers.

[40]  Jaume Abella,et al.  Selective replication: A lightweight technique for soft errors , 2009, TOCS.

[41]  M. Nicolaidis,et al.  Design for soft error mitigation , 2005, IEEE Transactions on Device and Materials Reliability.

[42]  N. Ranganathan,et al.  A strategy for soft error reduction in multi core designs , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[43]  L. Sterpone,et al.  A New Mitigation Approach for Soft Errors in Embedded Processors , 2008, IEEE Transactions on Nuclear Science.

[44]  Sri Parameswaran,et al.  Reli: Hardware/software Checkpoint and Recovery scheme for embedded processors , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[45]  Dakai Zhu,et al.  Reliability-Aware Energy Management for Periodic Real-Time Tasks , 2009, IEEE Trans. Computers.

[46]  Daniel P. Siewiorek,et al.  Reliable Computer Systems: Design and Evaluation, Third Edition , 1998 .

[47]  Sarita V. Adve,et al.  Architectures for online error detection and recovery in multicore processors , 2011, 2011 Design, Automation & Test in Europe.

[48]  Petru Eles,et al.  Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems With Checkpointing and Replication , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[49]  Keyvan Amini,et al.  Complexity analysis of interior-point methods for linear optimization based on some conditions on kernel function , 2006, Appl. Math. Comput..

[50]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[51]  Petru Eles,et al.  A standby-sparing technique with low energy-overhead for fault-tolerant hard real-time systems , 2009, CODES+ISSS '09.

[52]  Subhasish Mitra Globally Optimized Robust Systems to Overcome Scaled CMOS Reliability Challenges , 2008, 2008 Design, Automation and Test in Europe.

[53]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[54]  Meeta Sharma Gupta,et al.  DeCoR: A Delayed Commit and Rollback mechanism for handling inductive noise in processors , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[55]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.