FARM: Fault-aware resource management in NoC-based multiprocessor platforms

In this paper, we address the problem of run-time resource management in non-ideal multiprocessor platforms where communication happens via the Network-on-chip (NoCs) approach. More precisely, we propose a system-level fault-tolerant technique for application mapping which aims at optimizing the entire system performance and communication energy consumption, while considering the occurrence of permanent, transient, and intermittent faults in the system. As the main theoretical contribution, we address the problem of spare core placement and its impact on system fault-tolerance (FT) properties. Then, we investigate several metrics and provide insight into the fault-aware resource management process for such non-ideal multiprocessor platforms. Experimental results show that our proposed resource management technique is efficient and highly scalable and significant throughput improvements can be achieved compared to the existing solutions that do not consider failures in the system.

[1]  Bill Nitzberg,et al.  Noncontiguous Processor Allocation Algorithms for Mesh-Connected Multicomputers , 1997, IEEE Trans. Parallel Distributed Syst..

[2]  Diederik Verkest,et al.  Operating-system controlled network on chip , 2004, Proceedings. 41st Design Automation Conference, 2004..

[3]  Wolfgang Rosenstiel,et al.  Fully Adaptive Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[4]  Davide Bertozzi,et al.  Supporting Task Migration in Multi-Processor Systems-on-Chip: A Feasibility Study , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[5]  Sander Stuijk,et al.  Parallel implementation of arbitrary-shaped MPEG-4 decoder for multiprocessor systems , 2006, Electronic Imaging.

[6]  Johnny S. Wong,et al.  Efficient Task Migration Algorithm for Distributed Systems , 1992, IEEE Trans. Parallel Distributed Syst..

[7]  Qiang Xu,et al.  Lifetime reliability-aware task allocation and scheduling for MPSoC platforms , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[8]  Michael F. Morris Kiviat graphs: conventions and "figures of merit" , 1974, PERV.

[9]  Kwang-Ting Cheng,et al.  A Cost Analysis Framework for Multi-core Systems with Spares , 2008, 2008 IEEE International Test Conference.

[10]  Hannu Tenhunen,et al.  Agent-Monitored Fault-Tolerant Network-on-Chips : Concept, Hierarchy, and Case Study with FFT Application , 2008 .

[11]  David M. Bull,et al.  RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[12]  Luca Benini,et al.  Analysis of power consumption on switch fabrics in network routers , 2002, DAC '02.

[13]  Petru Eles,et al.  Fault and energy-aware communication mapping with guaranteed latency for applications implemented on NoC , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[14]  Wolfgang Rosenstiel,et al.  Fully Adaptive Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures , 2007 .

[15]  Fernando Gehm Moraes,et al.  Heuristics for Dynamic Task Mapping in NoC-based Heterogeneous MPSoCs , 2007, 18th IEEE/IFIP International Workshop on Rapid System Prototyping (RSP '07).

[16]  Radu Marculescu,et al.  Contention-aware application mapping for Network-on-Chip communication architectures , 2008, 2008 IEEE International Conference on Computer Design.

[17]  Jörg Henkel,et al.  ADAM: Run-time agent-based distributed application mapping for on-chip communication , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[18]  Sarita V. Adve,et al.  Accurate microarchitecture-level fault modeling for studying hardware faults , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.