SHiFA: System-level hierarchy in run-time fault-aware management of many-core systems

A system-level approach to fault-aware resource management of many-core systems is proposed. The proposed approach, called SHiFA, is able to tolerate run-time faults at system level without any hardware overhead. In contrast to the existing system-level methods, network resources are also considered to be potentially faulty. Accordingly, applications are mapped onto healthy nodes of the system at run-time such that their interaction will not require the use of faulty elements. By utilizing the simple routing approach, results show 100% utilizability of PEs and 99.41% of successful mapping when up to 8 links are broken. SHiFA design is based on distributed operating systems, such that it is kept scalable for future many-core systems. A significant improvement in scalability properties is observed compared to the state-of-the-art distributed approaches.

[1]  Esther M. Arkin,et al.  Processor allocation on Cplant: achieving general processor locality using one-dimensional allocation strategies , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[2]  Valeria Bertacco,et al.  uDIREC: Unified diagnosis and reconfiguration for frugal bypass of NoC faults , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[4]  Hannu Tenhunen,et al.  Adjustable contiguity of run-time task allocation in networked many-core systems , 2014, 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC).

[5]  Fernando Gehm Moraes,et al.  Heuristics for Dynamic Task Mapping in NoC-based Heterogeneous MPSoCs , 2007, 18th IEEE/IFIP International Workshop on Rapid System Prototyping (RSP '07).

[6]  Anant Agarwal,et al.  Factored operating systems (fos): the case for a scalable operating system for multicores , 2009, OPSR.

[7]  Radu Marculescu,et al.  Energy- and Performance-Aware Incremental Mapping for Networks on Chip With Multiple Voltage Levels , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Donatella Sciuto,et al.  An adaptive approach for online fault management in many-core architectures , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Alois Knoll,et al.  Towards fault-tolerant embedded systems with imperfect fault detection , 2012, DAC Design Automation Conference 2012.

[10]  David Blaauw,et al.  A highly resilient routing algorithm for fault-tolerant NoCs , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[11]  Terrence S. T. Mak,et al.  Run-time deadlock detection in networks-on-chip using coupled transitive closure networks , 2011, 2011 Design, Automation & Test in Europe.

[12]  Pasi Liljeberg,et al.  CoNA: Dynamic application mapping for congestion reduction in many-core systems , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[13]  Sudeep Pasricha,et al.  NS-FTR: A fault tolerant routing scheme for networks on chip with permanent and runtime intermittent faults , 2011, 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011).

[14]  Radu Marculescu,et al.  FARM: Fault-aware resource management in NoC-based multiprocessor platforms , 2011, 2011 Design, Automation & Test in Europe.

[15]  Pasi Liljeberg,et al.  Exploration of MPSoC monitoring and management systems , 2011, 6th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC).

[16]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[17]  Ching-Te Chiu,et al.  On the design and analysis of fault tolerant NoC architecture using spare routers , 2011, 16th Asia and South Pacific Design Automation Conference (ASP-DAC 2011).

[18]  Evripidis Bampis,et al.  Scheduling Algorithms for Parallel Gaussian Elimination With Communication Costs , 1998, IEEE Trans. Parallel Distributed Syst..

[19]  Onur Derin,et al.  Online task remapping strategies for fault-tolerant Network-on-Chip multiprocessors , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[20]  Luca Benini,et al.  Analysis of error recovery schemes for networks on chips , 2005, IEEE Design & Test of Computers.

[21]  Erik B. van der Tol,et al.  Mapping of MPEG-4 decoding on a flexible architecture platform , 2001, IS&T/SPIE Electronic Imaging.

[22]  Amit Kumar Singh,et al.  Mapping on multi/many-core systems: Survey of current and emerging trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[23]  David Blaauw,et al.  Vicis: A reliable network for unreliable silicon , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[24]  Iraklis Anagnostopoulos,et al.  A divide and conquer based distributed run-time mapping methodology for many-core platforms , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[25]  Timothy Mattson,et al.  A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS , 2010, 2010 IEEE International Solid-State Circuits Conference - (ISSCC).

[26]  Iraklis Anagnostopoulos,et al.  Distributed run-time resource management for malleable applications on many-core platforms , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[27]  Alain Greiner,et al.  A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[28]  Mohammad Hosseinabady,et al.  Run-time stochastic task mapping on a large scale network-on-chip with dynamically reconfigurable tiles , 2012, IET Comput. Digit. Tech..

[29]  Wolfgang Schröder-Preikschat,et al.  DistRM: Distributed resource management for on-chip many-core systems , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[30]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[31]  Akash Kumar,et al.  Fault-aware task re-mapping for throughput constrained multimedia applications on NoC-based MPSoCs , 2012, 2012 23rd IEEE International Symposium on Rapid System Prototyping (RSP).

[32]  Kwang-Ting Cheng,et al.  End-to-end error correction and online diagnosis for on-chip networks , 2011, 2011 IEEE International Test Conference.

[33]  Pasi Liljeberg,et al.  Smart hill climbing for agile dynamic mapping in many-core systems , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[34]  Lorena Anghel,et al.  Essential Fault-Tolerance Metrics for NoC Infrastructures , 2007, 13th IEEE International On-Line Testing Symposium (IOLTS 2007).

[35]  Srinivasan Murali,et al.  Bandwidth-constrained mapping of cores onto NoC architectures , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[36]  Luca Benini,et al.  NoC synthesis flow for customized domain specific multiprocessor systems-on-chip , 2005, IEEE Transactions on Parallel and Distributed Systems.

[37]  Dhiraj K. Pradhan,et al.  Reliable network-on-chip based on generalized de Bruijn graph , 2007, 2007 IEEE International High Level Design Validation and Test Workshop.