REMEDIATE: A scalable fault-tolerant architecture for low-power NUCA cache in tiled CMPs

Technology scaling and process variation severely degrade the reliability of Chip Multiprocessors (CMPs), especially their large cache blocks. To improve cache reliability, we propose REMEDIATE, a scalable fault-tolerant architecture for low-power design of shared Non-Uniform Cache Access (NUCA) cache in Tiled CMPs. REMEDIATE achieves fault-tolerance through redundancy from multiple banks to maximize the amount of fault remapping, and minimize the amount of capacity lost in the cache when the failure rate is high. REMEDIATE leverages a scalable fault protection technique using two different remapping heuristics in a distributed shared cache architecture with non-uniform latencies. We deploy a graph coloring algorithm to optimize REMEDIATE's remapping configuration. We perform an extensive design space exploration of operating voltage, performance, and power that enables designers to select different operating points and evaluate their design efficacy. Experimental results on a 4×4 tiled CMP system voltage scaled to below 400mV show that REMEDIATE saves up to 50% power while recovering more than 80% of the faulty cache area with only modest performance degradation.

[1]  Amin Ansari,et al.  Enabling ultra low voltage system operation by tolerating on-chip cache failures , 2009, ISLPED.

[2]  Yiran Chen,et al.  Tolerating process variations in large, set-associative caches: The buddy cache , 2009, TACO.

[3]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[4]  Dean M. Tullsen,et al.  Fellowship - Simulation And Modeling Of A Simultaneous Multithreading Processor , 1996, Int. CMG Conference.

[5]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[6]  Avesta Sasan,et al.  History & Variation Trained Cache (HVT-Cache): A process variation aware and fine grain voltage scalable cache with active access history monitoring , 2012, Thirteenth International Symposium on Quality Electronic Design (ISQED).

[7]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Avesta Sasan,et al.  Variation Trained Drowsy Cache (VTD-Cache): A History Trained Variation Aware Drowsy Cache for Fine Grain Voltage Scaling , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9]  Amin Ansari,et al.  Archipelago: A polymorphic cache design for enabling robust near-threshold operation , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[10]  David A. Wood,et al.  Managing Wire Delay in Large Chip-Multiprocessor Caches , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[11]  Nikil D. Dutt,et al.  A novel NoC-based design for fault-tolerance of last-level caches in CMPs , 2012, CODES+ISSS '12.

[12]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[13]  Doug Burger,et al.  An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches , 2002, ASPLOS X.

[14]  Lucas Rioux-Maldague Graph Coloring Algorithms , 2014 .

[15]  Nikil D. Dutt,et al.  FFT-Cache: A Flexible Fault-Tolerant Cache architecture for ultra low voltage operation , 2011, 2011 Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES).

[16]  Avesta Sasan,et al.  Process Variation Aware SRAM/Cache for aggressive voltage-frequency scaling , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[17]  Avesta Sasan,et al.  Inquisitive Defect Cache: A Means of Combating Manufacturing Induced Process Variation , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18]  Trevor N. Mudge,et al.  On-Chip Cache Device Scaling Limits and Effective Fault Repair Techniques in Future Nanoscale Technology , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[19]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[20]  Yiran Chen,et al.  The salvage cache: A fault-tolerant cache architecture for next-generation memory technologies , 2009, 2009 IEEE International Conference on Computer Design.

[21]  Avesta Sasan,et al.  A fault tolerant cache architecture for sub 500mV operation: resizable data composer cache (RDC-cache) , 2009, CASES '09.

[22]  Nikil D. Dutt,et al.  E < MC2: less energy through multi-copy cache , 2010, CASES '10.

[23]  Yu Cao,et al.  A resilience roadmap , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[24]  Hussein Al-Omari,et al.  New Graph Coloring Algorithms , 2006 .

[25]  Sarita V. Adve,et al.  Architectures for online error detection and recovery in multicore processors , 2011, 2011 Design, Automation & Test in Europe.

[26]  Huawei Li,et al.  Address Remapping for Static NUCA in NoC-Based Degradable Chip-Multiprocessors , 2010, 2010 IEEE 16th Pacific Rim International Symposium on Dependable Computing.

[27]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[28]  Srinivas Devadas,et al.  Scalable, accurate multicore simulation in the 1000-core era , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[29]  David Blaauw,et al.  Yield-Driven Near-Threshold SRAM Design , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[30]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, MICRO.

[31]  A. Chandrakasan,et al.  A 256kb Sub-threshold SRAM in 65nm CMOS , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.