Improving Reliability of Multi-/Many-Core Processors by Using NMR-MPar Approach

The new trend in computing systems is providing solutions by using multicore and many-core processors. COTS processors are preferred because they offer a high performance with low-power consumption within an affordable price. Lately these devices have been used in High Performance Computing systems due to their massive parallelism and low-power budget. For the last decade, industrial and academic partners have worked together to overcome with dependability issues to extend their usage in embedded systems. Despite of multiple proposals for improving the multi-core reliability, their use is not yet validated for critical tasks. This chapter describes a new fault-tolerance approach called NMR-MPar which is based on N-Modular Redundancy and M-Partitions to improve the reliability of applications running on these devices. The evaluation of the effectiveness of the NMR-MPar approach on two complementary benchmark applications running on the 28 nm CMOS MPPA-256 many-core processor has shown the possibility to consider this approach for mixed-criticality systems. Finally, this chapter analyses the overhead of the approach in terms of power consumption and energy.

[1]  Sudhakar M. Reddy,et al.  Cache size selection for performance, energy and reliability of time-constrained systems , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[2]  Selma Saidi,et al.  The shift to multicores in real-time and safety-critical systems , 2015, 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[3]  D. Bellin,et al.  Predicting the SEU error rate through fault injection for a complex microprocessor , 2008, 2008 IEEE International Symposium on Industrial Electronics.

[4]  Luigi Carro,et al.  Modern GPUs Radiation Sensitivity Evaluation and Mitigation Through Duplication With Comparison , 2014, IEEE Transactions on Nuclear Science.

[5]  Jon Perez,et al.  MultiPARTES: Multi-core partitioning and virtualization for easing the certification of mixed-criticality systems , 2014, Microprocess. Microsystems.

[6]  Mahmut T. Kandemir,et al.  Increasing register file immunity to transient errors , 2005, Design, Automation and Test in Europe.

[7]  Nacer-Eddine Zergainoh,et al.  Radiation Experiments on a 28 nm Single-Chip Many-Core Processor and SEU Error-Rate Prediction , 2017, IEEE Transactions on Nuclear Science.

[8]  Andrea Höller,et al.  Software-Based Fault Recovery via Adaptive Diversity for COTS Multi-Core Processors , 2015, ArXiv.

[9]  Nacer-Eddine Zergainoh,et al.  Preliminary results of SEU fault-injection on multicore processors in AMP mode , 2014, 2014 IEEE 20th International On-Line Testing Symposium (IOLTS).

[10]  Diana Franklin,et al.  Efficient fault tolerance in multi-media applications through selective instruction replication , 2008, WREFT '08.

[11]  Benoît Dupont de Dinechin,et al.  A Distributed Run-Time Environment for the Kalray MPPA®-256 Integrated Manycore Processor , 2013, ICCS.

[12]  William J. Cook,et al.  The Traveling Salesman Problem: A Computational Study , 2007 .

[13]  Stephen P. Crago,et al.  Software-based fault tolerance for the Maestro many-core processor , 2011, 2011 Aerospace Conference.

[14]  Philippe Olivier Alexandre Navaux,et al.  On the energy efficiency and performance of irregular application executions on multicore, NUMA and manycore platforms , 2015, J. Parallel Distributed Comput..

[15]  Israel Koren,et al.  Reliability Analysis of N-Modular Redundancy Systems with Intermittent and Permanent Faults , 1979, IEEE Transactions on Computers.

[16]  Mehdi Baradaran Tahoori,et al.  Balancing Performance and Reliability in the Memory Hierarchy , 2005, IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005..

[17]  Naresh R. Shanbhag,et al.  Soft N-Modular Redundancy , 2012, IEEE Transactions on Computers.

[18]  Gunar Schirner,et al.  Application-specific power-efficient approach for reducing register file vulnerability , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Makoto Sugihara,et al.  Task scheduling for reliable cache architectures of multiprocessor systems , 2007 .

[20]  Wei Zhang,et al.  ICR: in-cache replication for enhancing data cache reliability , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[21]  Nacer-Eddine Zergainoh,et al.  Evaluating SEU fault-injection on parallel applications implemented on multicore processors , 2015, 2015 IEEE 6th Latin American Symposium on Circuits & Systems (LASCAS).

[22]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[23]  Ravishankar K. Iyer,et al.  Active replication of multithreaded applications , 2006, IEEE Transactions on Parallel and Distributed Systems.

[24]  Jean-François Méhaut,et al.  NMR-MPar: A Fault-Tolerance Approach for Multi-Core and Many-Core Processors , 2018 .

[25]  Wei Zhang,et al.  Replication cache: a small fully associative cache to improve data cache reliability , 2005, IEEE Transactions on Computers.

[26]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[27]  Arun K. Somani,et al.  Area efficient architectures for information integrity in cache memories , 1999, ISCA.

[28]  Francisco J. Cazorla,et al.  Parallel many-core avionics systems , 2014, 2014 International Conference on Embedded Software (EMSOFT).

[29]  Carlos Villalpando,et al.  Reliable multicore processors for NASA space missions , 2011, 2011 Aerospace Conference.

[30]  E. Normand Single-event effects in avionics , 1996 .

[31]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[32]  Nacer-Eddine Zergainoh,et al.  Sensitivity to Neutron Radiation of a 45 nm SOI Multi-Core Processor , 2015, 2015 15th European Conference on Radiation and Its Effects on Components and Systems (RADECS).

[33]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[34]  Donatella Sciuto,et al.  An adaptive approach for online fault management in many-core architectures , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[35]  Martin L. Shooman,et al.  Reliability of Computer Systems and Networks: Fault Tolerance,Analysis,and Design , 2002 .

[36]  Andras Vajda Multi-core and Many-core Processor Architectures , 2011 .

[37]  E. Normand,et al.  A Multicore Server SEE Cross Section Model , 2012, IEEE Transactions on Nuclear Science.

[38]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[39]  Nacer-Eddine Zergainoh,et al.  Evaluating the SEE Sensitivity of a 45 nm SOI Multi-Core Processor Due to 14 MeV Neutrons , 2016, IEEE Transactions on Nuclear Science.

[40]  Jean-François Méhaut,et al.  Swifi fault injector for heterogeneous many-core processors , 2018 .

[41]  Zaid Al-Ars,et al.  Efficient software-based fault tolerance approach on multicore platforms , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[42]  Simin Nadjm-Tehrani,et al.  Challenges in Future Avionic Systems on Multi-Core Platforms , 2014, 2014 IEEE International Symposium on Software Reliability Engineering Workshops.

[43]  Sanjoy K. Baruah,et al.  Mixed-Criticality Real-Time Scheduling for Multicore Systems , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.