A Scalable and Reconfigurable Fault-Tolerant Distributed Routing Algorithm for NoCs

Manufacturing defects in the deep sub-micron VLSI process and aging resulted problems of devices during lifecycle are inevitable, and fault-tolerant routing algorithms are important to provide the required communication for NoCs in spite of failures. The proposed algorithm, referred to as scalable and reconfigurable fault-tolerant distributed routing (RFDR), partitions the system into nine regions using the concept of divide-and-conquer. It is a distributed algorithm, and each router guarantees fault-tolerance within one's own region and the system can be still sustained with multiple fault areas. The proposed RFDR has excellent scalability with hardware cost keeping constant independent of system size. Also it is completely reconfigurable when new nodes fail. Simulations under various synthetic traffic patterns show its better performance compared to Extended-XY routing algorithm. Moreover, there is almost no hardware overhead compared to Logic-Based Distributed Routing (LBDR), but the fault-tolerance capacity is enhanced in the proposed algorithm. Hardware cost is reduced 37% compared to Reconfigurable Distributed Scalable Predictable Interconnect Network (R-DSPIN) which only supports single fault region.

[1]  Steve B. Furber Living with Failure: Lessons from Nature? , 2006, ETS.

[2]  Shashi Kumar,et al.  A Method for Router Table Compression for Application Specific Routing in Mesh Topology NoC Architectures , 2006, SAMOS.

[3]  Alain Greiner,et al.  A reconfigurable routing algorithm for a fault-tolerant 2D-Mesh Network-on-Chip , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[4]  William J. Dally,et al.  Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.

[5]  Mahmut T. Kandemir,et al.  Fault tolerant algorithms for network-on-chip interconnect , 2004, IEEE Computer Society Annual Symposium on VLSI.

[6]  An-Yeu Wu,et al.  Traffic-Balanced Routing Algorithm for Irregular Mesh-Based On-Chip Networks , 2008, IEEE Transactions on Computers.

[7]  Shekhar Y. Borkar,et al.  Thousand Core ChipsA Technology Perspective , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[8]  José Duato,et al.  An Efficient Implementation of Distributed Routing Algorithms for NoCs , 2008, Second ACM/IEEE International Symposium on Networks-on-Chip (nocs 2008).

[9]  Ran Ginosar,et al.  Routing Table Minimization for Irregular Mesh NoCs , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[10]  S. Kumar,et al.  Design issues and performance evaluation of mesh NoC with regions , 2005, 2005 NORCHIP.

[11]  Radu Marculescu,et al.  Towards on-chip fault-tolerant communication , 2003, ASP-DAC '03.

[12]  Zhiyi Yu,et al.  A scalable and fault-tolerant routing algorithm for NoCs , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[13]  Jie Wu,et al.  A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model , 2003, IEEE Trans. Computers.

[14]  Suresh Chalasani,et al.  Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks , 1995, IEEE Trans. Computers.

[15]  Ge-Ming Chiu,et al.  Fault-Tolerant Routing Algorithm for Meshes without Using Virtual Channels , 1998, J. Inf. Sci. Eng..

[16]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[17]  Zhiyi Yu,et al.  A reconfigurable and deadlock-free routing algorithm for 2D Mesh Network-on-Chip , 2011, 2011 IEEE International Symposium of Circuits and Systems (ISCAS).

[18]  Shashi Kumar,et al.  Corrections to Chen and Chiu's Fault Tolerant Routing Algorithm for Mesh Networks , 2007, J. Inf. Sci. Eng..

[19]  S. Martel,et al.  System Design of an Integrated Measurement Electronic Subsystem for Bacteria Detection Using an Electrode Array and MC-1 Magnetotactic Bacteria , 2007, 2006 International Workshop on Computer Architecture for Machine Perception and Sensing.

[20]  Zhiyi Yu,et al.  A 167-Processor Computational Platform in 65 nm CMOS , 2009, IEEE Journal of Solid-State Circuits.

[21]  Sarita V. Adve,et al.  AS SCALING THREATENS TO ERODE RELIABILITY STANDARDS, LIFETIME RELIABILITY MUST BECOME A FIRST-CLASS DESIGN CONSTRAINT. MICROARCHITECTURAL INTERVENTION OFFERS A NOVEL WAY TO MANAGE LIFETIME RELIABILITY WITHOUT SIGNIFICANTLY SACRIFICING COST AND PERFORMANCE , 2005 .

[22]  José Duato,et al.  Segment-based routing: an efficient fault-tolerant routing algorithm for meshes and tori , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[23]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[24]  José Duato,et al.  Region-Based Routing: A Mechanism to Support Efficient Routing Algorithms in NoCs , 2009 .

[25]  José Duato,et al.  On the Potentials of Segment-Based Routing for NoCs , 2008, 2008 37th International Conference on Parallel Processing.

[26]  Amir Hosseini,et al.  A fault-aware dynamic routing algorithm for on-chip networks , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[27]  José Duato,et al.  Logic-Based Distributed Routing for NoCs , 2008, IEEE Computer Architecture Letters.