Unified system level reliability evaluation methodology for multiprocessor Systems-on-Chip

Reliability is a growing fundamental challenge in the design of multiprocessor Systems-on-Chip (MPSoCs). This trend is accelerated by the increasingly adverse process variations and wearout mechanisms that result in an increased number of errors. Previously proposed fault-tolerant techniques are ad-hoc and target processors or Networks-on-Chip (NoC) separately. Because each of these two units may become a reliability bottleneck for NoC based multiprocessor SoCs, it is imperative that the reliability of SoCs be evaluated and addressed in a unified manner, as a combination of communication and computational units. Using this holistic approach, in this paper, we propose a new architecture level unified reliability evaluation methodology for MPSoCs. At the core of the reliability estimation engine lies a Monte Carlo algorithm which works with failure times for time-dependent dielectric breakdown (TDDB) and negative bias temperature instability (NBTI) modeled as Weibull distributions. To demonstrate its usefulness, we utilize the proposed methodology to explore the impact of NoC router layout on the failure time of the system running the same set of benchmarks. In addition, we investigate the failure time of the system when the NoC as the communication unit of the MPSoC is taken or not - as in previous work - into consideration. Our simulation framework can be very helpful to architecture designers, who could use it to identify architectural characteristics and to develop design techniques meant to improve system's lifetime.

[1]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[2]  Li Shang,et al.  Thermal Modeling, Characterization and Management of On-Chip Networks , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[3]  Yu Cao,et al.  A resilience roadmap , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[4]  Sriram R. Vangal,et al.  A 5-GHz Mesh Interconnect for a Teraflops Processor , 2007, IEEE Micro.

[5]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[6]  Dakai Zhu,et al.  Reliability-aware Dynamic Voltage Scaling for energy-constrained real-time embedded systems , 2008, 2008 IEEE International Conference on Computer Design.

[7]  Li Shang,et al.  Application-Specific MPSoC Reliability Optimization , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  Xiaojun Li,et al.  Compact Modeling of MOSFET Wearout Mechanisms for Circuit-Reliability Simulation , 2008, IEEE Transactions on Device and Materials Reliability.

[9]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[10]  Shigeyuki Murakami,et al.  Experimental Study on Buckling Strength of the Perforated Cylindrical Steel Tubular Members , 1996 .

[11]  Pradip Bose,et al.  A Framework for Architecture-Level Lifetime Reliability Modeling , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[12]  Rami G. Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[13]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[14]  Sarita V. Adve,et al.  Lifetime reliability aware microprocessors , 2006 .

[15]  David Bol,et al.  Interests and Limitations of Technology Scaling for Subthreshold Logic , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[16]  Jordi Suñé,et al.  Interplay of voltage and temperature acceleration of oxide breakdown for ultra-thin gate oxides , 2002 .

[17]  S. Adve,et al.  RAMP : A Model for Reliability Aware MicroProcessor Design , 2003 .

[18]  David Hutchison,et al.  Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines , 2010, Comput. Networks.

[19]  Ming Zhang,et al.  Circuit Failure Prediction and Its Application to Transistor Aging , 2007, 25th IEEE VLSI Test Symposium (VTS'07).

[20]  Yusuf Leblebici,et al.  Analysis and Optimization of MPSoC Reliability , 2006, J. Low Power Electron..

[21]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[22]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[23]  T. Ning,et al.  A model for negative bias temperature instability (NBTI) in oxide and high κ pFETs , 2004, VLSIT 2004.

[24]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Haytham Elmiligi,et al.  A reliability-aware design methodology for Networks-on-Chip applications , 2009, 2009 4th International Conference on Design & Technology of Integrated Systems in Nanoscal Era.

[26]  Zainalabedin Navabi,et al.  An Analytical Model for Reliability Evaluation of NoC Architectures , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[27]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[28]  Heather M. Quinn,et al.  Vision for cross-layer optimization to address the dual challenges of energy and reliability , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[29]  B.E. Helvik,et al.  Dependability modelling and analysis of networks as taking routing and traffic into account , 2006, 2006 2nd Conference on Next Generation Internet Design and Engineering, 2006. NGI '06..

[30]  David Blaauw,et al.  Multi-Mechanism Reliability Modeling and Management in Dynamic Systems , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[31]  Axel Jantsch,et al.  Networks on chip , 2003 .

[32]  Sun-Ok Gwon University of Texas at Austin의 연구 현황 , 2002 .

[33]  Pasi Liljeberg,et al.  Fault Tolerance Analysis of NoC Architectures , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[34]  James H. Stathis,et al.  Reliability limits for the gate insulator in CMOS technology , 2002, IBM J. Res. Dev..

[35]  D. Schroder,et al.  Negative bias temperature instability: Road to cross in deep submicron silicon semiconductor manufacturing , 2003 .

[36]  Giovanni De Micheli,et al.  Power and Reliability Management of SoCs , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[37]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[38]  B.H. Lee,et al.  A model for negative bias temperature instability (NBTI) in oxide and high /spl kappa/ pFETs 13/spl times/-C6D8C7F5F2 , 2004, Digest of Technical Papers. 2004 Symposium on VLSI Technology, 2004..

[39]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[40]  Priyadarsan Patra,et al.  Impact of Process and Temperature Variations on Network-on-Chip Design Exploration , 2008, Second ACM/IEEE International Symposium on Networks-on-Chip (nocs 2008).