Operating SECDED-based caches at ultra-low voltage with FLAIR

Voltage scaling is often limited by bit failures in large on-chip caches. Prior approaches for enabling cache operation at low voltages rely on correcting cache lines with multi-bit failures. Unfortunately, multi-bit Error Correcting Codes (ECC) incur significant storage overhead and complex logic. Our goal is to develop solutions that enable ultra-low voltage operation while incurring minimal changes to existing SECDED-based cache designs. We exploit the observation that only a small percentage of cache lines have multi-bit failures. We propose FLexible And Introspective Replication (FLAIR) that performs two-way replication for part of the cache during testing to maintain robustness, and disables lines with multi-bit failures after testing. FLAIR leverages the correction features of existing SECDED code to greatly improve on simple two-way replication. FLAIR provides a Vmin of 485mv (similar to ECC-8) and maintains robustness to soft-error, while incurring a storage overhead of only one bit per cache line.

[1]  Wei Wu,et al.  Energy-efficient cache design using variable-strength error-correcting codes , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[2]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[3]  Barton P. Miller,et al.  Reliable network connections , 2002, MobiCom '02.

[4]  Lorenzo Alvisi,et al.  Engineering fault-tolerant TCP/IP servers using FT-TCP , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[5]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[6]  Amin Ansari,et al.  Archipelago: A polymorphic cache design for enabling robust near-threshold operation , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[7]  Manish Marwah,et al.  Fault-tolerant and scalable TCP splice and web server architecture , 2006, 2006 25th IEEE Symposium on Reliable Distributed Systems (SRDS'06).

[8]  Jennifer Rexford,et al.  Seamless BGP Migration with Router Grafting , 2010, NSDI.

[9]  Liviu Iftode,et al.  Recovering Internet service sessions from operating system failures , 2005, IEEE Internet Computing.

[10]  Manish Marwah,et al.  Enhanced server fault-tolerance for improved user experience , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[11]  Robbert van Renesse,et al.  Routers for the Cloud: Can the Internet Achieve 5-Nines Availability? , 2011, IEEE Internet Computing.

[12]  Minlan Yu,et al.  Virtually eliminating router bugs , 2009, CoNEXT '09.

[13]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[14]  Louise E. Moser,et al.  Transparent TCP connection failover , 2003, 2003 International Conference on Dependable Systems and Networks, 2003. Proceedings..

[15]  Mon-Yen Luo,et al.  Constructing zero-loss Web services , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[16]  Liviu Iftode,et al.  Migratory TCP: connection migration for service continuity in the Internet , 2002, Proceedings 22nd International Conference on Distributed Computing Systems.

[17]  Lorenzo Alvisi,et al.  Wrapping server-side TCP to mask connection failures , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[18]  Kevin Reick,et al.  Power4 System Design for High Reliability , 2002, IEEE Micro.

[19]  Trevor N. Mudge,et al.  On-Chip Cache Device Scaling Limits and Effective Fault Repair Techniques in Future Nanoscale Technology , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[20]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[21]  K. Roy,et al.  A 160 mV Robust Schmitt Trigger Based Subthreshold SRAM , 2007, IEEE Journal of Solid-State Circuits.

[22]  Alberto Valderruten,et al.  Developing a functional Tcp/Ip stack oriented towards Tcp connection replication , 2005, LANC '05.

[23]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[24]  Hari Balakrishnan,et al.  Fine-Grained Failover Using Connection Migration , 2001, USITS.

[25]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[26]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Tarek F. Abdelzaher,et al.  Efficient TCP connection failover in Web server clusters , 2004, IEEE INFOCOM 2004.

[28]  Akhil Garg,et al.  Fuse Area Reduction based on Quantitative Yield Analysis and Effective Chip Cost , 2006, 2006 21st IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems.

[29]  Michel Dubois,et al.  CPPC: Correctable parity protected cache , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[30]  Stefan Rusu,et al.  Itanium 2 processor 6M: higher frequency and larger L3 cache , 2004, IEEE Micro.

[31]  Amin Ansari,et al.  ZerehCache: Armoring cache architectures in high defect density technologies , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[32]  Jaume Abella,et al.  Low Vccmin fault-tolerant cache with highly predictable performance , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Christof Fetzer,et al.  Tapping TCP streams , 2001, Proceedings IEEE International Symposium on Network Computing and Applications. NCA 2001.

[34]  Michael J. Freedman,et al.  Coercing clients into facilitating failover for object delivery , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN).

[35]  Jie Wu,et al.  AR-TCP: Actively Replicated TCP Connections for Cluster of Workstations , 2006, 2006 Japan-China Joint Workshop on Frontier of Computer Science and Technology.