Reliable Many-Core System-on-Chip Design Using K-Node Fault Tolerant Graphs

State-of-the-art techniques for enhancing system-level reliability for SoCs include both design-time and run-time strategies, such as task mapping and reliable communication network design. In contrast to task mapping where the network topology is predefined, fault-tolerance in the communication network design involves the reliability evaluation of the network topology. In this paper, we apply the idea of k-node fault tolerant graph to address the challenge of reliable network design. To determine k-node fault tolerant graph for an arbitrary subject graph is non-trivial. We propose a heuristic based on divide-and-conquer approach and validate the quality of the results with an exhaustive search for small graphs. The effectiveness of proposed methodology is demonstrated with real multiprocessor computational task using a commercial system-level design environment.

[1]  Dagmar Handke,et al.  Independent tree spanners: fault-tolerant spanning trees with constant distance guarantees , 2001, Discret. Appl. Math..

[2]  Li-Shiuan Peh,et al.  Polaris: A System-Level Roadmapping Toolchain for On-Chip Interconnection Networks , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Amit Kumar Singh,et al.  Mapping on multi/many-core systems: Survey of current and emerging trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[4]  Hokeun Kim,et al.  A task remapping technique for reliable multi-core embedded systems , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[5]  Bharadwaj Veeravalli,et al.  Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[6]  Onur Derin,et al.  Online task remapping strategies for fault-tolerant Network-on-Chip multiprocessors , 2011, Proceedings of the Fifth ACM/IEEE International Symposium.

[7]  Sheldon B. Akers,et al.  On Group Graphs and Their Fault Tolerance , 1987, IEEE Transactions on Computers.

[8]  Donald E. Thomas,et al.  Lifetime improvement through runtime wear-based task mapping , 2012, CODES+ISSS '12.

[9]  Subhasish Mitra,et al.  ERSA: Error Resilient System Architecture for probabilistic applications , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[10]  Tajana Simunic,et al.  Temperature management in multiprocessor SoCs using online learning , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[11]  Lothar Thiele,et al.  Scenario-based design flow for mapping streaming applications onto on-chip many-core systems , 2012, CASES '12.

[12]  Gilles Kahn,et al.  The Semantics of a Simple Language for Parallel Programming , 1974, IFIP Congress.

[13]  John P. Hayes,et al.  Edge fault tolerance in graphs , 1993, Networks.

[14]  John P. Hayes,et al.  Node fault tolerance in graphs , 1996, Networks.

[15]  David Castells-Rufas,et al.  Survey of NoC and Programming Models Proposals for MPSoC , 2012 .

[16]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[17]  Lih-Hsing Hsu,et al.  On the construction of combined k-fault-tolerant Hamiltonian graphs , 2001, Networks.

[18]  Dhiraj K. Pradhan,et al.  Reliable network-on-chip based on generalized de Bruijn graph , 2007, 2007 IEEE International High Level Design Validation and Test Workshop.

[19]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.