PERFECTORY: A Fault-Tolerant Directory Memory Architecture

The number of CPUs in chip multiprocessors is growing at the Moore's Law rate, due to continued technology advances. However, new technologies pose serious reliability challenges, such as more frequent occurrences of degraded or even nonoperational devices, and they threaten the cost-effectiveness and dependability of future computing systems. This work studies how to protect the on-chip coherence directory from fault occurrences. In a chip multiprocessor, cache coherence mechanisms such as directory memory are critical for offering consistent data view to all CPUs. We propose a novel online fault detection and correction scheme to enhance yield and resilience to runtime errors at a small performance cost. The proposed scheme uses smart encoding and coherence protocol adaptation strategies to salvage faulty directory entries. We also develop an online error recovery scheme that protects the directory memory from soft errors. We call our fault-tolerant directory memory architecture PERFECTORY. Evaluation results show that PERFECTORY achieves very high fault resilience: Over 99 percent chip yield at 0.05 percent hard error ratio and 1,934 years MTTF at 1,000 FIT using a 100-processor cluster configuration. PERFECTORY limits performance degradation to less than 1 percent at 0.05 percent hard error ratio and requires significantly smaller area overheads than existing redundancy approaches.

[1]  D. Lenoski,et al.  The SGI Origin: A ccnuma Highly Scalable Server , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[2]  Kaushik Roy,et al.  A process-tolerant cache architecture for improved yield in nanoscale technologies , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[4]  Anne-Marie Kermarrec,et al.  An Efficient and Scalable Approach for Implementing Fault-Tolerant DSM Architectures , 2000, IEEE Trans. Computers.

[5]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[6]  Jinuk Luke Shin,et al.  The UltraSPARC T1 Processor: CMT Reliability , 2006, IEEE Custom Integrated Circuits Conference 2006.

[7]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[8]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[9]  Dilip K. Bhavsar An algorithm for row-column self-repair of RAMs and its implementation in the Alpha 21264 , 1999, International Test Conference 1999. Proceedings (IEEE Cat. No.99CH37034).

[10]  Pradeep Dubey,et al.  Platform 2015: Intel ® Processor and Platform Evolution for the Next Decade , 2005 .

[11]  José Duato,et al.  A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[12]  Wei Chen,et al.  The 65-nm 16-MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series , 2007, IEEE Journal of Solid-State Circuits.

[13]  Qiang Li,et al.  Redundant linked list based cache coherence protocol , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[14]  Resve Saleh,et al.  Analysis and design of digital integrated circuits : in deep submicron technology , 2003 .

[15]  John Day A Fault-Driven, Comprehensive Redundancy Algorithm , 1985, IEEE Design & Test of Computers.

[16]  Hyunjin Lee,et al.  TPTS: A Novel Framework for Very Fast Manycore Processor Architecture Simulation , 2008, 2008 37th International Conference on Parallel Processing.

[17]  Pinaki Mazumder,et al.  An on-chip double-bit error-correcting code for three-dimensional dynamic random-access memory , 1988, International Test Conference 1988 Proceeding@m_New Frontiers in Testing.

[18]  Charles H. Stapper,et al.  Synergistic Fault-Tolerance for Memory Chips , 1992, IEEE Trans. Computers.

[19]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[20]  Brett D. Fleisch,et al.  A Dynamic Coherence Protocol for Distributed Shared Memory Enforcing High Data Availability at Low Costs , 1996, IEEE Trans. Parallel Distributed Syst..

[21]  M. K. Gowan,et al.  A 65nm 2-Billion-Transistor Quad-Core Itanium® Processor , 2008, 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[22]  Shyamkumar Thoziyoor,et al.  CACTI 5 . 1 , 2008 .

[23]  T. M. Mak,et al.  Do we need anything more than single bit error correction (ECC)? , 2004, Records of the 2004 International Workshop on Memory Technology, Design and Testing, 2004..

[24]  Mark D. Hill,et al.  Lamport clocks: verifying a directory cache-coherence protocol , 1998, SPAA '98.

[25]  Nhon Quach,et al.  High Availability and Reliability in the Itanium Processor , 2000, IEEE Micro.

[26]  C.W. Slayman,et al.  Cache and memory error detection, correction, and reduction techniques for terrestrial servers and workstations , 2005, IEEE Transactions on Device and Materials Reliability.

[27]  Jichuan Chang,et al.  Cooperative Caching for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[28]  Yuen H. Chan,et al.  IBM POWER6 SRAM arrays , 2007, IBM J. Res. Dev..

[29]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[30]  Hyunjin Lee,et al.  Performance of Graceful Degradation for Cache Faults , 2007, IEEE Computer Society Annual Symposium on VLSI (ISVLSI '07).

[31]  Hugh Garraway Parallel Computer Architecture: A Hardware/Software Approach , 1999, IEEE Concurrency.

[32]  Kourosh Gharachorloo,et al.  Architecture and design of AlphaServer GS320 , 2000, SIGP.

[33]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[34]  Michael Zhang,et al.  Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors , 2005, ISCA 2005.

[35]  Anoop Gupta,et al.  Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache Coherence Schemes , 1990, ICPP.

[36]  Gurindar S. Sohi Cache Memory Organization to Enhance the Yield of High-Performance VLSI Processors , 1989, IEEE Trans. Computers.

[37]  Alan Jay Smith,et al.  A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[38]  Shekhar Y. Borkar,et al.  Microarchitecture and Design Challenges for Gigascale Integration , 2004, MICRO.

[39]  Hyunjin Lee,et al.  Exploring the interplay of yield, area, and performance in processor caches , 2007, 2007 25th International Conference on Computer Design.

[40]  Dhiraj K. Pradhan,et al.  Matrix Codes: Multiple Bit Upsets Tolerant Method for SRAM Memories , 2007, 22nd IEEE International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT 2007).

[41]  Stein Gjessing,et al.  Distributed-directory scheme: scalable coherent interface , 1990, Computer.

[42]  R. Kumar,et al.  An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[43]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[44]  Mark Horowitz,et al.  An evaluation of directory schemes for cache coherence , 1998, ISCA '98.

[45]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[46]  M. K. Gowan,et al.  A 65 nm 2-Billion Transistor Quad-Core Itanium Processor , 2009, IEEE Journal of Solid-State Circuits.

[47]  Sangyeun Cho,et al.  Managing Distributed, Shared L2 Caches through OS-Level Page Allocation , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[48]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[49]  Mark D. Hill,et al.  Performance Implications of Tolerating Cache Faults , 1993, IEEE Trans. Computers.

[50]  James F. Frenzel,et al.  Defect-tolerant cache memory design , 1993, Digest of Papers Eleventh Annual 1993 IEEE VLSI Test Symposium.