Accelerated assessment of fine-grain AVF in NoC using a Multi-Cell Upsets considered fault injection

Abstract With the increasing threat of soft errors induced bits upset, Network on Chip (NoC) as the communication infrastructure in many-core systems has been proven a reliability bottleneck in a fault tolerant parallel system. The often-used metric Architecture Vulnerability Factor (AVF), measures the architecture-level soft error impacts to compromise the design cost of fault tolerant schemes and reliability well. As a complementary of existing estimation methods about standard IP like processor and Cache, this work aims at an accelerated fault injection methodology for the fine-grain AVF assessment in NoC via two components: (1) modeling the complex fault patterns of both Multi-Cell Upsets (MCU) and Single Bit Upset (SBU) in the standard Fault Injection (FI) method; (2) accelerating the estimation via classifying and exploiting the fine-grain metrics according to different error impacts. The comprehensive simulation results using the diverse configures (e.g., varying fault model, benchmark, traffic load, network size and fault list size) also demonstrate that the proposed approach (i) shrinks the estimation inaccuracy due to MCU patterns 18.89% underestimation in no protection case and 88.92% overestimation under ECC (Error Correction Coding) protection on average; (ii) achieves about 5× speedup without estimation accuracy loss via phased pre-analysis based on fine-grain classification; (iii) verifies ECC a cost-effective mechanism to protect NoC router: soft errors reduced by about 50% over the no protection case, with only less than 2% area overhead.

[1]  Chrysostomos Nicopoulos,et al.  NoCAlert: An On-Line and Real-Time Fault Detection Mechanism for Network-on-Chip Architectures , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[2]  Naresh R. Shanbhag,et al.  Soft-Error-Rate-Analysis (SERA) Methodology , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[3]  Valeria Bertacco,et al.  Formally enhanced runtime verification to ensure NoC functional correctness , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Luigi Carro,et al.  Evaluation of SEU and crosstalk effects in network-on-chip switches , 2006, SBCCI '06.

[5]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[6]  Jiang Jiang,et al.  Architecture-level analysis and evaluation of transient errors on NoC , 2011, 2011 NORCHIP.

[7]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[8]  Dan Alexandrescu A comprehensive soft error analysis methodology for SoCs/ASICs memory instances , 2011, 2011 IEEE 17th International On-Line Testing Symposium.

[9]  Michail Maniatakos,et al.  Workload-driven selective hardening of control state elements in modern microprocessors , 2010, 2010 28th VLSI Test Symposium (VTS).

[10]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[11]  Ahmad Patooghy,et al.  Reliability & Performance Modeling to Speed-Up the NoC Design , 2009, 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing.

[12]  Zainalabedin Navabi,et al.  An Analytical Model for Reliability Evaluation of NoC Architectures , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[13]  Bernhard Fechner A multilevel fault model for integrated parallel fault‐tolerant systems , 2012, Concurr. Comput. Pract. Exp..

[14]  Chita R. Das,et al.  Design and analysis of an NoC architecture from performance, reliability and energy perspective , 2005, 2005 Symposium on Architectures for Networking and Communications Systems (ANCS).

[15]  Chita R. Das,et al.  ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[16]  David Blaauw,et al.  Vicis: A reliable network for unreliable silicon , 2009, 2009 46th ACM/IEEE Design Automation Conference.