Cores that don't count

We are accustomed to thinking of computers as fail-stop, especially the cores that execute instructions, and most system software implicitly relies on that assumption. During most of the VLSI era, processors that passed manufacturing tests and were operated within specifications have insulated us from this fiction. As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often "silent" - the only symptom is an erroneous computation. We refer to a core that develops such behavior as "mercurial." Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem - one that will require collaboration between hardware designers, processor vendors, and systems software architects. This paper is a call-to-action for a new focus in systems research; we speculate about several software-based approaches to mercurial cores, ranging from better detection and isolating mechanisms, to methods for tolerating the silent data corruption they cause.

[1]  Jon Masters,et al.  On the Spectre and Meltdown Processor Security Vulnerabilities , 2019, IEEE Micro.

[2]  Laura Monroe,et al.  SDC is in the Eye of the Beholder: A Survey and Preliminary Study , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[3]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[4]  Daniel M. Roy,et al.  Enhancing Server Availability and Security Through Failure-Oblivious Computing , 2004, OSDI.

[5]  Luiz C. Alves,et al.  Reliability, availability, and serviceability (RAS) of the IBM eServer z990 , 2004, IBM J. Res. Dev..

[6]  Riccardo Mariani Soft Errors on Digital Components , 2003 .

[7]  Robert E. Lyons,et al.  The Use of Triple-Modular Redundancy to Improve Computer Reliability , 1962, IBM J. Res. Dev..

[8]  Derek Bruening,et al.  AddressSanitizer: A Fast Address Sanity Checker , 2012, USENIX Annual Technical Conference.

[9]  Idit Keidar,et al.  CSR: Core Surprise Removal in Commodity Operating Systems , 2016, ASPLOS.

[10]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[11]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[12]  Sriram Sankar,et al.  Silent Data Corruptions at Scale , 2021, ArXiv.

[13]  R. Velazco,et al.  Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors , 2000 .

[14]  Song Fu,et al.  Empirical Studies of the Soft Error Susceptibility ofSorting Algorithms to Statistical Fault Injection , 2015, FTXS@HPDC.

[15]  Dingwen Tao,et al.  Silent Data Corruption Resilient Two-sided Matrix Factorizations , 2017, PPoPP.

[16]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[17]  Manuel Blum,et al.  Designing programs that check their work , 1989, STOC '89.

[18]  Miguel Oom Temudo de Castro,et al.  Practical Byzantine fault tolerance , 1999, OSDI '99.

[19]  Tim Güneysu,et al.  BasicBlocker: Redesigning ISAs to Eliminate Speculative-Execution Attacks , 2020, ArXiv.

[20]  Robert B. Ross,et al.  Fail-Slow at Scale , 2018, ACM Trans. Storage.

[21]  Sangmin Lee,et al.  Upright cluster services , 2009, SOSP '09.

[22]  Robert B. Ross,et al.  Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems , 2018, FAST.

[23]  Jim Gray,et al.  Fault Tolerance in Tandem Computer Systems , 1987 .

[24]  Sujan Pandey,et al.  Transient errors resiliency analysis technique for automotive safety critical applications , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).