Communication and migration energy aware design space exploration for multicore systems with intermittent faults

Shrinking transistor geometries, aggressive voltage scaling and higher operating frequencies have negatively impacted the dependability of embedded multicore systems. Most existing research works on fault-tolerance have focused on transient and permanent faults of cores. Intermittent faults are a separate class of defects resulting from on-chip temperature, pressure and voltage variations and lasting for a few cycles to several seconds or more. Operations of cores impacted by intermittent faults are suspended during these cycles but come back alive when conditions become favorable. This paper proposes a technique to model the availability of multiprocessor systems-on-chip (MPSoCs) with intermittent and reparable device defects. This model is based on Markov chain with stochastic fault distribution and can be applied even for permanent faults. Based on this model, a design space pruning technique is proposed to select a set of task mappings (with variable resource usage), which minimizes the task communication energy while satisfying the MPSoC availability constraint. Moreover, task migration overhead is also minimized, which is an important consideration for frequently occurring intermittent and temperature related faults, where prolonged system downtime during task re-mapping is not desired. Experiments conducted with real-life and synthetic application task graphs demonstrate that the proposed technique minimizes communication energy by 30% and reduces migration overhead by 50% as compared to the existing approaches.

[1]  Tajana Simunic,et al.  Temperature Aware Task Scheduling in MPSoCs , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[2]  Akash Kumar,et al.  Fault-aware task re-mapping for throughput constrained multimedia applications on NoC-based MPSoCs , 2012, 2012 23rd IEEE International Symposium on Rapid System Prototyping (RSP).

[3]  Lothar Thiele,et al.  Thermal-aware system analysis and software synthesis for embedded multi-processors , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[4]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[5]  Radu Marculescu,et al.  Energy-aware communication and task scheduling for network-on-chip architectures under real-time constraints , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[6]  Dakai Zhu,et al.  Generalized reliability-oriented energy management for real-time embedded applications , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[7]  Xiaobo Sharon Hu,et al.  Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[8]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[9]  Yu Hu,et al.  IVF: Characterizing the vulnerability of microprocessor structures to intermittent faults , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[10]  Radu Marculescu,et al.  FARM: Fault-aware resource management in NoC-based multiprocessor platforms , 2011, 2011 Design, Automation & Test in Europe.

[11]  Qiang Xu,et al.  Lifetime reliability-aware task allocation and scheduling for MPSoC platforms , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[12]  Bharadwaj Veeravalli,et al.  Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[13]  Donald E. Thomas,et al.  A case for lifetime-aware task mapping in embedded chip multiprocessors , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[14]  Alois Knoll,et al.  Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[15]  Koushik Chakraborty,et al.  Adapting to Intermittent Faults in Future Multicore Systems , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[16]  Jian-Jia Chen,et al.  Thermal-aware lifetime reliability in multicore systems , 2010, 2010 11th International Symposium on Quality Electronic Design (ISQED).

[17]  Amit Kumar Singh,et al.  A hybrid strategy for mapping multiple throughput-constrained applications on MPSoCs , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).