Fault-Tolerant Deployment of Dataflow Applications Using Virtual Processors

Multi-processors are suited to host a dynamic mix of real-time dataflow applications, but are increasingly subject to faults because of the decreasing feature size. Applications can start and stop as needed if they execute on a private set of Virtual Processors (VPs) that are deployed on the physical processors. This allows online software updates, but makes it impossible to predict the deployment. If a fault renders a processor unusable, the free resources on other processors may be too fragmented to allow its VPs to be re-deployed. We show that mapping an application to more VPs reduces the maximum VP size. This increases the probability of successfully dealing with faults, at the cost of an increase of the total size. Such a mapping can either be run from the start, or we can split the VPs only when a fault occurs. Experiments confirm the feasibility of our approach, and show a trade-off between improved fault-tolerance and resource usage for both strategies.

[1]  Kees G. W. Goossens,et al.  Composable and Predictable Dynamic Loading for Time-Critical Partitioned Systems , 2014, DSD.

[2]  Orlando Moreira,et al.  Online resource management in a multiprocessor with a network-on-chip , 2007, SAC '07.

[3]  Twan Basten,et al.  Tight temporal bounds for dataflow applications mapped onto shared resources , 2016, 2016 11th IEEE Symposium on Industrial Embedded Systems (SIES).

[4]  Amit Kumar Singh,et al.  Mapping on multi/many-core systems: Survey of current and emerging trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Sarita V. Adve,et al.  mSWAT: Low-cost hardware fault detection and diagnosis for multicore systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6]  Elena Dubrova,et al.  Fault-Tolerant Design , 2013 .

[7]  Radu Marculescu,et al.  FARM: Fault-aware resource management in NoC-based multiprocessor platforms , 2011, 2011 Design, Automation & Test in Europe.

[8]  Hokeun Kim,et al.  A task remapping technique for reliable multi-core embedded systems , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[9]  Axel Jantsch,et al.  Reliability-Aware Runtime Power Management for Many-Core Systems in the Dark Silicon Era , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[10]  Soonhoi Ha,et al.  Static mapping of mixed-critical applications for fault-tolerant MPSoCs , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Lothar Thiele,et al.  Scenario-based design flow for mapping streaming applications onto on-chip many-core systems , 2012, CASES '12.

[12]  David I. August,et al.  Design and evaluation of hybrid fault-detection systems , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[13]  Heba Khdr,et al.  Towards performance and reliability-efficient computing in the dark silicon era , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[14]  Hermann Kopetz,et al.  The time-triggered architecture , 1998, Proceedings First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98).

[15]  Guntram Scheithauer,et al.  Introduction to Cutting and Packing Optimization , 2018 .

[16]  Orlando Moreira,et al.  A new data flow analysis model for TDM , 2012, EMSOFT '12.

[17]  Sander Stuijk,et al.  A Predictable Multiprocessor Design Flow for Streaming Applications with Dynamic Behaviour , 2010, DSD.

[18]  Eckhard Grass,et al.  Globally Asynchronous, Locally Synchronous Circuits: Overview and Outlook , 2007, IEEE Design & Test of Computers.

[19]  Sander Stuijk,et al.  NoC-Based Multiprocessor Architecture for Mixed-Time-Criticality Applications , 2017, Handbook of Hardware/Software Codesign.

[20]  Onur Derin,et al.  Online task remapping strategies for fault-tolerant Network-on-Chip multiprocessors , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.