Chip Self-Organization and Fault Tolerance in Massively Defective Multicore Arrays

We study chip self-organization and fault tolerance at the architectural level to improve dependable continuous operation of multicore arrays in massively defective nanotechnologies. Architectural self-organization results from the conjunction of self-diagnosis and self-disconnection mechanisms (to identify and isolate most permanently faulty or inaccessible cores and routers), plus self-discovery of routes to maintain the communication in the array. In the methodology presented in this work, chip self-diagnosis is performed in three steps, following an ascending order of complexity: interconnects are tested first, then routers through mutual test, and cores in the last step. The mutual testing of routers is especially important as faulty routers are disconnected by good ones with no assumption on the behavior of defective elements. Moreover, the disconnection of faulty routers is not physical (“hard”) but logical (“soft”) in that a good router simply stops communicating with any adjacent router diagnosed as defective. There is no physical reconfiguration in the chip and no need for spare elements. Ultimately, the multicore array may be viewed as a black box, which incorporates protection mechanisms and self-organizes, while the external control reduces to a simple chip validation test which, in the simplest cases, reduces to counting the number of valid and accessible cores.

[1]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[2]  Qiang Xu,et al.  Defect Tolerance in Homogeneous Manycore Processors Using Core-Level Redundancy with Unified Topology , 2008, 2008 Design, Automation and Test in Europe.

[3]  Giuseppe Lipari,et al.  A Flexible Scheme for Scheduling Fault-Tolerant Real-Time Tasks on Multiprocessors , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Jacques Henri Collet,et al.  Self-Configuration and Reachability Metrics in Massively Defective Multiport Chips , 2008, 2008 14th IEEE International On-Line Testing Symposium.

[5]  Franco P. Preparata,et al.  The cube-connected-cycles: A versatile network for parallel computation , 1979, 20th Annual Symposium on Foundations of Computer Science (sfcs 1979).

[6]  École Doctorale Fault Tolerance through Self-configuration in the Future Nanoscale Multiprocessors , 2008 .

[7]  Dimitris Gizopoulos,et al.  Effective software-based self-test strategies for on-line periodic testing of embedded processors , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[8]  Piotr Zajac Fault tolerance through self-configuration in the future nanoscale multiprocessors , 2008 .

[9]  M. Hatzimihail,et al.  A methodology for detecting performance faults in microprocessors via performance monitoring hardware , 2007, 2007 IEEE International Test Conference.

[10]  William Lindsay,et al.  FRITS - a microprocessor functional BIST method , 2002, Proceedings. International Test Conference.

[11]  Ismet Bayraktaroglu,et al.  Cache Resident Functional Microprocessor Testing: Avoiding High Speed IO Issues , 2006, 2006 IEEE International Test Conference.

[12]  Toshinori Sato,et al.  Power-Performance Trade-Off of a Dependable Multicore Processor , 2007, 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007).

[13]  Anoop Gupta,et al.  The Stanford Dash multiprocessor , 1992, Computer.

[14]  Dhiraj K. Pradhan,et al.  Test scheduling for network-on-chip with BIST and precedence constraints , 2004, 2004 International Conferce on Test.

[15]  Israel Koren,et al.  Defect tolerance in VLSI circuits: techniques and yield analysis , 1998, Proc. IEEE.

[16]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[17]  Niraj K. Jha,et al.  A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS , 2007, ICCD.

[18]  佐藤 寿倫,et al.  Multiple Clustered Core Processors , 2006 .

[19]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[20]  Ahmed Louri,et al.  A Class of Highly Scalable Optical Crossbar-Connected Interconnection Networks (SOCNs) for Parallel Computing Systems , 2000, IEEE Trans. Parallel Distributed Syst..

[21]  Sujit Dey,et al.  A scalable software-based self-test methodology for programmable processors , 2003, DAC '03.

[22]  Israel Koren,et al.  Fault tolerance in VLSI circuits , 1990, Computer.

[23]  Juan L. Aragón,et al.  Adapting Dynamic Core Coupling to a direct-network environment , 2008 .

[24]  Adit D. Singh Interstitial Redundancy: An Area Efficient Fault Tolerance Scheme for Large Area VLSI Processor Arrays , 1988, IEEE Trans. Computers.

[25]  Jean Arlat,et al.  IEEE Transactions on Dependable and Secure Computing , 2006 .

[26]  Mihalis Psarakis,et al.  Functional Self-Testing for Bus-Based Symmetric Multiprocessors , 2008, 2008 Design, Automation and Test in Europe.

[27]  Tung Le,et al.  Testing of UltraSPARC T1 Microprocessor and its Challenges , 2006, 2006 IEEE International Test Conference.

[28]  Jacques Henri Collet,et al.  Production Yield and Self-Configuration in the Future Massively Defective Nanochips , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[29]  H.-J. Yoo,et al.  A high-speed and lightweight on-chip crossbar switch scheduler for on-chip interconnection networks , 2003, ESSCIRC 2004 - 29th European Solid-State Circuits Conference (IEEE Cat. No.03EX705).

[30]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[31]  A.J. van de Goor,et al.  BIST for ring-address SRAM-type FIFOs , 1994, Proceedings of IEEE International Workshop on Memory Technology, Design, and Test.

[32]  Rudy Lauwereins,et al.  Highly scalable network on chip for reconfigurable systems , 2003, Proceedings. 2003 International Symposium on System-on-Chip (IEEE Cat. No.03EX748).

[33]  Giovanni Squillero,et al.  Automatic test program generation: a case study , 2004, IEEE Design & Test of Computers.

[34]  Robert Metcalfe,et al.  Reverse path forwarding of broadcast packets , 1978, CACM.

[35]  Vahid Lari,et al.  Assessment of Message Missing Failures in FlexRay-Based Networks , 2007 .

[36]  Wen-Chung Shen,et al.  Cost-Efficient Fault-Tolerant Router Design for 2 D-Mesh Based Chip Multiprocessor Systems , 2008 .

[37]  Jacob A. Abraham,et al.  Automatic generation of instruction sequences targeting hard-to-detect structural faults in a processor , 2006, 2006 IEEE International Test Conference.

[38]  J. Meindl,et al.  The impact of intrinsic device fluctuations on CMOS SRAM cell stability , 2001, IEEE J. Solid State Circuits.

[39]  Abbas El Gamal,et al.  Configuration of VLSI Arrays in the Presence of Defects , 1984, JACM.

[40]  Anant Agarwal,et al.  Limits on Interconnection Network Performance , 1991, IEEE Trans. Parallel Distributed Syst..

[41]  Paolo Santi,et al.  Self diagnosis of processor arrays using a comparison model , 1995, Proceedings. 14th Symposium on Reliable Distributed Systems.

[42]  R. Dean Adams,et al.  High Performance Memory Testing: Design Principles, Fault Modeling and Self-Test , 2002 .

[43]  Vinod K. Agarwal,et al.  Almost Sure Diagnosis of Almost Every Good Element , 1994, IEEE Trans. Computers.

[44]  Andrew R. Brown,et al.  Simulation of intrinsic parameter fluctuations in decananometer and nanometer-scale MOSFETs , 2003 .

[45]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[46]  P. Jonker,et al.  A defect-?and fault-tolerant architecture for nanocomputers , 2003 .

[47]  Sampath Rangarajan,et al.  Built-In Testing of Integrated Circuit Wafers , 1990, IEEE Trans. Computers.

[48]  Srivaths Ravi,et al.  Systematic Software-Based Self-Test for Pipelined Processors , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[49]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[50]  Mack W. Riley,et al.  Testability features of the first-generation CELL processor , 2005, IEEE International Conference on Test, 2005..

[51]  GERNOT METZE,et al.  On the Connection Assignment Problem of Diagnosable Systems , 1967, IEEE Trans. Electron. Comput..

[52]  Partha Pratim Pande,et al.  BIST for network-on-chip interconnect infrastructures , 2006, 24th IEEE VLSI Test Symposium.

[53]  Alexandre M. Amory,et al.  Wrapper Design for the Reuse of Networks-on-Chip as Test Access Mechanism , 2006, Eleventh IEEE European Test Symposium (ETS'06).

[54]  Alexander Taubin,et al.  A GALS Solution Based on Highly Scalable, Low Latency, Crossbar Using Token Ring Arbitration , 2006, 2006 49th IEEE International Midwest Symposium on Circuits and Systems.