A lightweight and open-source framework for the lifetime estimation of multicore systems

This paper presents a Monte Carlo-based framework for the estimation of lifetime reliability of multicore systems. Existing mathematical tools either consider only the time to the first failure, or are limited by their intrinsic complexity and high computational time. The proposed framework allows to compute quasi-exact results with a reasonable computational time, without adopting typical (and possibly misleading) simplifications that characterize the existing tools for computing Mean Time To Failure (MTTF). The paper describes the framework with all its mathematical details, assumptions and simplifications; it proves the correctness of the obtained results, by comparing them against the exact ones, and underlines the differences with the simplistic approaches, also discussing time overhead improvements.

[1]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[2]  William Fornaciari,et al.  A Temperature and Reliability Oriented Simulation Framework for Multi-core Architectures , 2012, 2012 IEEE Computer Society Annual Symposium on VLSI.

[3]  David Blaauw,et al.  Multi-Mechanism Reliability Modeling and Management in Dynamic Systems , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[4]  Qiang Xu,et al.  Customer-aware task allocation and scheduling for multi-mode MPSoCs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Xiaobo Sharon Hu,et al.  Enhancing multicore reliability through wear compensation in online assignment and scheduling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[6]  Bharadwaj Veeravalli,et al.  Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Yusuf Leblebici,et al.  Analysis and Optimization of MPSoC Reliability , 2006, J. Low Power Electron..

[8]  Bharadwaj Veeravalli,et al.  Run-time mapping for reliable many-cores based on energy/performance trade-offs , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[9]  Li Shang,et al.  Application-Specific MPSoC Reliability Optimization , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[10]  Li Shang,et al.  System-level reliability modeling for MPSoCs , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[11]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[12]  Qiang Xu,et al.  AgeSim: A simulation framework for evaluating the lifetime reliability of processor-based SoCs , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[13]  Donald E. Thomas,et al.  Lifetime improvement through runtime wear-based task mapping , 2012, CODES+ISSS '12.

[14]  Majid Asadi,et al.  Reliability analysis of consecutive k-out-of- n systems with non-identical components lifetimes , 2011 .

[15]  Qiang Xu,et al.  On Task Allocation and Scheduling for Lifetime Extension of Platform-Based MPSoC Designs , 2011, IEEE Transactions on Parallel and Distributed Systems.

[16]  Qiang Xu,et al.  Lifetime Reliability for Load-Sharing Redundant Systems With Arbitrary Failure Distributions , 2010, IEEE Transactions on Reliability.

[17]  Huamin Liu Reliability of a load-sharing k-out-of-n:G system: non-iid components with arbitrary distributions , 1998 .