Guiding fault-driven adaption in multicore systems through a reliability-aware static task schedule

Future multicore systems suffer from high and varying fault rates due to device scaling, increasing number of processing notes, varying environmental issues and aging effects. Efficient fault tolerant solutions capable of combining the advantages of static optimization and runtime adaptation are needed. To achieve this goal, we propose a static reliability-aware scheduling technique, aiming to guide runtime adaptation and relieve most of the computational overhead. The proposed static scheduler considers “reliability level” (RL) as an intermediate scheduling dimension and creates a “task-to-RL-to-core” mapping. This enables the “RL-to-core” mapping to be efficiently adapted at runtime according to fault rate variations, while the “task-to-RL” mapping can still be reused. Experimental studies show that by considering fault rates during static scheduling, runtime application execution time can be improved by up to 19% in a non-constant fault rate environment.

[1]  Li Shang,et al.  System-level reliability modeling for MPSoCs , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[2]  Tajana Simunic,et al.  Temperature-aware MPSoC scheduling for reducing hot spots and gradients , 2008, 2008 Asia and South Pacific Design Automation Conference.

[3]  Alex Orailoglu,et al.  Tackling Resource Variations Through Adaptive Multicore Execution Frameworks , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[5]  Ishfaq Ahmad,et al.  Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[6]  Donatella Sciuto,et al.  An adaptive approach for online fault management in many-core architectures , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[7]  Stephen L. Scott,et al.  Reliability-aware resource allocation in HPC systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[8]  Xiaobo Sharon Hu,et al.  Enhancing multicore reliability through wear compensation in online assignment and scheduling , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Xiao Qin,et al.  Dynamic, reliability-driven scheduling of parallel real-time jobs in heterogeneous systems , 2001, International Conference on Parallel Processing, 2001..

[10]  Alois Knoll,et al.  Analysis and optimization of fault-tolerant task scheduling on multiprocessor embedded systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[11]  Alex Orailoglu,et al.  Fully adaptive multicore architectures through statically-directed dynamic execution reconfigurations , 2010, 2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip.

[12]  Muhammad Shafique,et al.  RASTER: Runtime adaptive spatial/temporal error resiliency for embedded processors , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[13]  Petru Eles,et al.  Scheduling and Optimization of Fault-Tolerant Embedded Systems with Transparency/Performance Trade-Offs , 2012, TECS.

[14]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.