Architecting a reliable CMP switch architecture

As silicon technologies move into the nanometer regime, transistor reliability is expected to wane as devices become subject to extreme process variation, particle-induced transient errors, and transistor wear-out. Unless these challenges are addressed, computer vendors can expect low yields and short mean-times-to-failure. In this article, we examine the challenges of designing complex computing systems in the presence of transient and permanent faults. We select one small aspect of a typical chip multiprocessor (CMP) system to study in detail, a single CMP router switch. Our goal is to design a BulletProof CMP switch architecture capable of tolerating significant levels of various types of defects. We first assess the vulnerability of the CMP switch to transient faults. To better understand the impact of these faults, we evaluate our CMP switch designs using circuit-level timing on detailed physical layouts. Our infrastructure represents a new level of fidelity in architectural-level fault analysis, as we can accurately track faults as they occur, noting whether they manifest or not, because of masking in the circuits, logic, or architecture. Our experimental results are quite illuminating. We find that transient faults, because of their fleeting nature, are of little concern for our CMP switch, even within large switch fabrics with fast clocks. Next, we develop a unified model of permanent faults, based on the time-tested bathtub curve. Using this convenient abstraction, we analyze the reliability versus area tradeoff across a wide spectrum of CMP switch designs, ranging from unprotected designs to fully protected designs with on-line repair and recovery capabilities. Protection is considered at multiple levels from the entire system down through arbitrary partitions of the design. We find that designs are attainable that can tolerate a larger number of defects with less overhead than naïve triple-modular redundancy, using domain-specific techniques, such as end-to-end error detection, resource sparing, automatic circuit decomposition, and iterative diagnosis and reconfiguration.

[1]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[2]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[3]  James H. Stathis,et al.  Reliability limits for the gate insulator in CMOS technology , 2002, IBM J. Res. Dev..

[4]  MahlkeScott,et al.  Architecting a reliable CMP switch architecture , 2007 .

[5]  William J. Dally,et al.  The Reliable Router: A Reliable and High-Performance Communication Substrate for Parallel Computers , 1994, PCRCW.

[6]  Adit D. Singh,et al.  Relating yield models to burn-in fall-out in time , 2003, International Test Conference, 2003. Proceedings. ITC 2003..

[7]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[8]  Daniel P. Siewiorek,et al.  Reliable Computer Systems: Design and Evaluation, Third Edition , 1998 .

[9]  Sule Ozev,et al.  Tolerating hard faults in microprocessor array structures , 2004, International Conference on Dependable Systems and Networks, 2004.

[10]  Lisa Spainhower,et al.  G4: a fault-tolerant CMOS mainframe , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[11]  Bernhard M. Riess,et al.  SPEED: fast and efficient timing driven placement , 1995, Proceedings of ISCAS'95 - International Symposium on Circuits and Systems.

[12]  H. Al-Asaad,et al.  Design verification via simulation and automatic test pattern generation , 1995, Proceedings of IEEE International Conference on Computer Aided Design (ICCAD).

[13]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[14]  Vipin Kumar,et al.  Multilevel k-way hypergraph partitioning , 1999, DAC '99.

[15]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[16]  Daniel P. Siewiorek,et al.  Reliable computer systems - design and evaluation (3. ed.) , 1992 .

[17]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[18]  Michael S. Floyd,et al.  Fault-tolerant design of the IBM pSeries 690 system using POWER4 processor technology , 2002, IBM J. Res. Dev..

[19]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[20]  Edward J. McCluskey,et al.  Dependable adaptive computing systems-the ROAR project , 1998, SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218).

[21]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[22]  Kaustav Banerjee,et al.  Few electron devices: towards hybrid CMOS-SET integrated circuits , 2002, DAC '02.

[23]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[24]  Li-Shiuan Peh,et al.  Flow control and micro-architectural mechanisms for extending the performance of interconnection networks , 2001 .

[25]  Alan Messer,et al.  Susceptibility of commodity systems and software to memory soft errors , 2004, IEEE Transactions on Computers.

[26]  John P. Hayes,et al.  Testing ICs: Getting to the Core of the Problem , 1996, Computer.

[27]  Eberhard Böhl,et al.  The fail-stop controller AE11 , 1997, Proceedings International Test Conference 1997.

[28]  Vipin Kumar,et al.  Hmetis: a hypergraph partitioning package , 1998 .

[29]  Jaehyuk Huh,et al.  Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture , 2003, ISCA '03.

[30]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[31]  B. Agarwala,et al.  Scaling effect on electromigration in on-chip Cu wiring , 1999, Proceedings of the IEEE 1999 International Interconnect Technology Conference (Cat. No.99EX247).

[32]  James F. Ziegler,et al.  Terrestrial cosmic rays , 1996, IBM J. Res. Dev..

[33]  Shashi Shekhar,et al.  Multilevel hypergraph partitioning: application in VLSI domain , 1997, DAC.

[34]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[35]  Vivek De,et al.  Design and reliability challenges in nanometer technologies , 2004, Proceedings. 41st Design Automation Conference, 2004..

[36]  David Blaauw,et al.  Statistical estimation of leakage current considering inter- and intra-die process variation , 2003, ISLPED '03.

[37]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[38]  John P. Hayes,et al.  Design verification via simulation and automatic test pattern generation , 1995, ICCAD.

[39]  Andrew B. Kahng,et al.  Manufacturing-aware physical design , 2003, ICCAD-2003. International Conference on Computer Aided Design (IEEE Cat. No.03CH37486).

[40]  David Blaauw,et al.  Estimation of the likelihood of capacitive coupling noise , 2002, DAC '02.

[41]  Jordi Suñé,et al.  Interplay of voltage and temperature acceleration of oxide breakdown for ultra-thin gate oxides , 2002 .

[42]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[43]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).