CSR: Core Surprise Removal in Commodity Operating Systems

One of the adverse effects of shrinking transistor sizes is that processors have become increasingly prone to hardware faults. At the same time, the number of cores per die rises. Consequently, core failures can no longer be ruled out, and future operating systems for many-core machines will have to incorporate fault tolerance mechanisms. We present CSR, a strategy for recovery from unexpected permanent processor faults in commodity operating systems. Our approach overcomes surprise removal of faulty cores, and also tolerates cascading core failures. When a core fails in user mode, CSR terminates the process executing on that core and migrates the remaining processes in its run-queue to other cores. We further show how hardware transactional memory may be used to overcome failures in critical kernel code. Our solution is scalable, incurs low overhead, and is designed to integrate into modern operating systems. We have implemented it in the Linux kernel, using Haswell's Transactional Synchronization Extension, and tested it on a real system.

[1]  Theo Ungerer,et al.  Fault detection and tolerance mechanisms for future 1000 core systems , 2013, 2013 International Conference on High Performance Computing & Simulation (HPCS).

[2]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[3]  Shlomi Dolev,et al.  Towards Self-Stabilizing Operating Systems , 2008, IEEE Transactions on Software Engineering.

[4]  Paolo Faraboschi,et al.  COTSon: infrastructure for full system simulation , 2009, OPSR.

[5]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.

[6]  Torvald Riegel,et al.  Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack , 2010, EuroSys '10.

[7]  Feng Zhao,et al.  Energy aware consolidation for cloud computing , 2008, CLUSTER 2008.

[8]  Hermann Härtig,et al.  Who Watches the Watchmen? Protecting Operating System Reliability Mechanisms , 2012, HotDep.

[9]  Theo Ungerer,et al.  Impact of Message Based Fault Detectors on Applications Messages in a Network on Chip , 2013, 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[10]  Maurice Herlihy,et al.  Transactional Memory: Architectural Support For Lock-free Data Structures , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[11]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[12]  Anoop Gupta,et al.  Hive: fault containment for shared-memory multiprocessors , 1995, SOSP.

[13]  Cecilia Metra,et al.  Error correcting code analysis for cache memory high reliability and performance , 2011, 2011 Design, Automation & Test in Europe.

[14]  Li Zhao,et al.  VM3: Measuring, modeling and managing VM shared resources , 2009, Comput. Networks.

[15]  Adrian Schüpbach,et al.  The multikernel: a new OS architecture for scalable multicore systems , 2009, SOSP '09.

[16]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[17]  Theo Ungerer,et al.  Fault Localization in NoCs Exploiting Periodic Heartbeat Messages in a Many-Core Environment , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[18]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[19]  Axel Jantsch,et al.  Methods for fault tolerance in networks-on-chip , 2013, CSUR.

[20]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[21]  Maged M. Michael,et al.  Evaluation of Blue Gene/Q hardware support for transactional memories , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Osman S. Unsal,et al.  FaulTM: Error detection and recovery using Hardware Transactional Memory , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[23]  George C. Necula,et al.  SafeDrive: safe and recoverable extensions using language-based techniques , 2006, OSDI '06.

[24]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX Annual Technical Conference, FREENIX Track.

[25]  Wolfgang Mauerer,et al.  Professional Linux Kernel Architecture , 2008 .

[26]  Avi Mendelson,et al.  Architectural Support for Fault Tolerance in a Teradevice Dataflow System , 2014, International Journal of Parallel Programming.

[27]  Brian N. Bershad,et al.  Improving the reliability of commodity operating systems , 2005, TOCS.

[28]  Paul E. McKenney,et al.  Cleaning up Linux's CPU hotplug for real time and energy management , 2012, SIGBED.

[29]  Avi Mendelson,et al.  TERAFLUX: Harnessing dataflow in next generation teradevices , 2014, Microprocess. Microsystems.

[30]  Robert Morris,et al.  Optimizing MapReduce for Multicore Architectures , 2010 .

[31]  Herbert Bos,et al.  MINIX 3: a highly reliable, self-repairing operating system , 2006, OPSR.

[32]  Michael M. Swift,et al.  Chameleon: operating system support for dynamic processors , 2012, ASPLOS XVII.

[33]  Guu-Chang Yang Reliability of semiconductor RAMs with soft-error scrubbing techniques , 1995 .

[34]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[35]  Samuel T. King,et al.  Recovery domains: an organizing principle for recoverable operating systems , 2009, ASPLOS.

[36]  Jie Liu,et al.  Algorithm Design for Performance Aware VM Consolidation , 2013 .

[37]  Ravi Rajwar,et al.  Speculative lock elision: enabling highly concurrent multithreaded execution , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[38]  Andrzej Kochut,et al.  Dynamic Placement of Virtual Machines for Managing SLA Violations , 2007, 2007 10th IFIP/IEEE International Symposium on Integrated Network Management.

[39]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[40]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[41]  Jeffrey Katcher,et al.  PostMark: A New File System Benchmark , 1997 .

[42]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[43]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[44]  Christopher J. Hughes,et al.  Performance evaluation of Intel® Transactional Synchronization Extensions for high-performance computing , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[45]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[46]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[47]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[48]  Calton Pu,et al.  An Analysis of Performance Interference Effects in Virtual Environments , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[49]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[50]  Timothy Roscoe,et al.  Decoupling Cores, Kernels, and Operating Systems , 2014, OSDI.

[51]  Donald E. Porter,et al.  TxLinux: using and managing hardware transactional memory in an operating system , 2007, SOSP.

[52]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[53]  Gabriel Parmer,et al.  Predictable, Efficient System-Level Fault Tolerance in C^3 , 2013, 2013 IEEE 34th Real-Time Systems Symposium.

[54]  Yiran Chen,et al.  The salvage cache: A fault-tolerant cache architecture for next-generation memory technologies , 2009, 2009 IEEE International Conference on Computer Design.

[55]  Gernot Heiser Many-core chips — a case for virtual shared memory , 2009 .

[56]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[57]  Kevin Klues,et al.  Improving per-node efficiency in the datacenter with new OS abstractions , 2011, SoCC.