Reliability Optimization on Multi-Core Systems with Multi-Tasking and Redundant Multi-Threading

Using Redundant Multithreading (RMT) for error detection and recovery is a prominent technique to mitigate soft-error effects in multi-core systems. Simultaneous Redundant Threading (SRT) on the same core or Chip-level Redundant Multithreading (CRT) on different cores can be adopted to implement RMT. However, only a few previously proposed approaches use adaptive CRT managements on the system level and none of them considers both SRT and CRT on the task level. In this paper, we propose to use a combination of SRT and CRT, called Mixed Redundant Threading (MRT), as an additional option on the task level. In our coarse-grained approach, we consider SRT, CRT, and MRT on the system level simultaneously, while the existing results only apply either SRT or CRT on the system level, but not simultaneously. In addition, we consider further fine-grained task level optimizations to improve the system reliability under hard real-time constraints. To optimize the system reliability, we develop several dynamic programming approaches to select the redundancy levels under Federated Scheduling. The simulation results illustrate that our approaches can significantly improve the system reliability compared to the state-of-the-art techniques.

[1]  Rolf Ernst,et al.  Response-Time Analysis of Parallel Fork-Join Workloads with Real-Time Constraints , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[2]  Shuai Wang,et al.  In-Register Duplication: Exploiting Narrow-Width Value for Improving Register File Reliability , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[3]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[4]  Petru Eles,et al.  Scheduling and Optimization of Fault-Tolerant Embedded Systems with Transparency/Performance Trade-Offs , 2012, TECS.

[5]  Muhammad Shafique,et al.  Reliable code generation and execution on unreliable hardware under joint functional and timing reliability considerations , 2013, 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS).

[6]  Jian-Jia Chen Federated scheduling admits no constant speedup factors for constrained-deadline DAG task systems , 2016, Real-Time Systems.

[7]  Muhammad Shafique,et al.  Cross-Layer Software Dependability on Unreliable Hardware , 2016, IEEE Transactions on Computers.

[8]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[9]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[10]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[11]  Bharadwaj Veeravalli,et al.  Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Bharadwaj Veeravalli,et al.  Run-time mapping for reliable many-cores based on energy/performance trade-offs , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[13]  Muhammad Shafique,et al.  Reliable software for unreliable hardware: Embedded code generation aiming at reliability , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[14]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[15]  Mahmut T. Kandemir,et al.  Soft error and energy consumption interactions: a data cache perspective , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[16]  Muhammad Shafique,et al.  Instruction scheduling for reliability-aware compilation , 2012, DAC Design Automation Conference 2012.

[17]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[18]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[19]  Russell Tessier,et al.  Multicore soft error rate stabilization using adaptive dual modular redundancy , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[20]  Chenyang Lu,et al.  Analysis of Federated and Global Scheduling for Parallel Real-Time Tasks , 2014, 2014 26th Euromicro Conference on Real-Time Systems.

[21]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[22]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[23]  Chang-Gun Lee,et al.  Multicore scheduling of parallel real-time tasks with multiple parallelization options , 2015, 21st IEEE Real-Time and Embedded Technology and Applications Symposium.

[24]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[25]  Muhammad Shafique,et al.  Leveraging variable function resilience for selective software reliability on unreliable hardware , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[26]  Taieb Znati,et al.  Shadow Replication: An Energy-Aware, Fault-Tolerant Computational Model for Green Cloud Computing , 2014 .

[27]  Muhammad Shafique,et al.  Task Mapping for Redundant Multithreading in Multi-Cores with Reliability and Performance Heterogeneity , 2016, IEEE Transactions on Computers.

[28]  Alan Burns,et al.  Improved priority assignment for global fixed priority pre-emptive scheduling in multiprocessor real-time systems , 2010, Real-Time Systems.

[29]  Björn Andersson,et al.  The utilization bounds of partitioned and pfair static-priority scheduling on multiprocessors are 50% , 2003, 15th Euromicro Conference on Real-Time Systems, 2003. Proceedings..

[30]  Muhammad Shafique,et al.  dTune: Leveraging reliable code generation for adaptive dependability tuning under process variation and aging-induced effects , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[31]  Wang Yi,et al.  Parametric Utilization Bounds for Fixed-Priority Multiprocessor Scheduling , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[32]  Muhammad Shafique,et al.  Reliability-Driven Software Transformations for Unreliable Hardware , 2014, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[33]  James H. Anderson,et al.  Fair scheduling of dynamic task systems on multiprocessors , 2005, J. Syst. Softw..