Vicis: A reliable network for unreliable silicon

Process scaling has given designers billions of transistors to work with. As feature sizes near the atomic scale, extensive variation and wearout inevitably make margining uneconomical or impossible. The ElastIC project seeks to address this by creating a large-scale chip-multiprocessor that can self-diagnose, adapt, and heal. Creating large, flexible designs in this environment naturally lends itself to the repetitive nature of network-on-chip (NoC), but the loss of a single link or router will result in complete network failure. In this work we present Vicis, an ElastIC-style NoC that can tolerate the loss of many network components due to wearout induced hard faults. Vicis uses the inherent redundancy in the network and its routers in order to maintain correct operation while incurring a much lower area overhead than previously proposed N-modular redundancy (NMR) based solutions. Each router has a built-in-self-test (BIST) that diagnoses the locations of hard fault and runs a number of algorithms to best use ECC, port swapping, and a crossbar bypass bus to mitigate them. The routers work together to run distributed algorithms to solve network-wide problems as well, protecting the networking against critical failures in individual routers. In this work we show that with stuck-at fault rates as high as 1 in 2000 gates, Vicis will continue to operate with approximately half of its routers still functional and communicating.

[1]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[2]  Valentin Puente,et al.  Immunet: a cheap and robust fault-tolerant packet routing mechanism , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[3]  Lionel M. Ni,et al.  Fault-tolerant routing in hypercube multicomputers using local safety information , 1996 .

[4]  Jie Wu,et al.  A Fault-Tolerant and Deadlock-Free Routing Protocol in 2D Meshes Based on Odd-Even Turn Model , 2003, IEEE Trans. Computers.

[5]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[6]  Kwang-Ting Cheng,et al.  A Framework for System Reliability Analysis Considering Both System Error Tolerance and Component Test Quality , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[7]  Luca Benini,et al.  Low power error resilient encoding for on-chip data buses , 2002, Proceedings 2002 Design, Automation and Test in Europe Conference and Exhibition.

[8]  Scott A. Mahlke,et al.  BulletProof: a defect-tolerant CMP switch architecture , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[9]  Lionel M. Ni,et al.  Fault-tolerant wormhole routing in meshes without virtual channels , 1996, IEEE Transactions on Parallel and Distributed Systems.

[10]  Chita R. Das,et al.  Exploring Fault-Tolerant Network-on-Chip Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[11]  Axel Jantsch,et al.  A fault model notation and error-control scheme for switch-to-switch buses in a network-on-chip , 2003, First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721).

[12]  Tobias Bjerregaard,et al.  A survey of research and practices of Network-on-chip , 2006, CSUR.

[13]  Antonio Robles,et al.  An Efficient Fault-Tolerant Routing Methodology for Meshes and Tori , 2004, IEEE Computer Architecture Letters.

[14]  Shekhar Y. Borkar,et al.  Microarchitecture and Design Challenges for Gigascale Integration , 2004, MICRO.

[15]  Paulo F. Butzen,et al.  An Array-Based Test Circuit for Fully Automated Gate Dielectric Breakdown Characterization , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[16]  David Blaauw,et al.  ElastIC: An Adaptive Self-Healing Architecture for Unpredictable Silicon , 2006, IEEE Design & Test of Computers.

[17]  José Duato,et al.  Efficient unicast and multicast support for CMPs , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[18]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[19]  Jipeng Zhou,et al.  Multi-phase minimal fault-tolerant wormhole routing in meshes , 2004, Parallel Comput..

[20]  David Blaauw,et al.  Reliability modeling and management in dynamic microprocessor-based systems , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[21]  William J. Dally,et al.  The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers , 1994, PCRCW.

[22]  David Blaauw,et al.  A highly resilient routing algorithm for fault-tolerant NoCs , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[23]  Larry J. Stockmeyer,et al.  A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers , 2002, IEEE Transactions on Computers.