Operating system support for redundant multithreading

In modern commodity operating systems, core functionality is usually designed assuming that the underlying processor hardware always functions correctly. Shrinking hardware feature sizes break this assumption. Existing approaches to cope with these issues either use hardware functionality that is not available in commercial-off-the-shelf (COTS) systems or poses additional requirements on the software development side, making reuse of existing software hard, if not impossible. In this paper we present Romain, a framework that provides transparent redundant multithreading1 as an operating system service for hardware error detection and recovery. When applied to a standard benchmark suite, Romain requires a maximum runtime overhead of 30% for triple-modular redundancy (while in many cases remaining below 5%). Furthermore, our approach minimizes the complexity added to the operating system for the sake of replication.

[1]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[2]  Xin Li,et al.  A Memory Soft Error Measurement on Production Systems , 2007, USENIX Annual Technical Conference.

[3]  Hsien-Hsin S. Lee,et al.  3D-MAPS: 3D Massively parallel processor with stacked memory , 2012, 2012 IEEE International Solid-State Circuits Conference.

[4]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[5]  George Candea,et al.  Microreboot - A Technique for Cheap Recovery , 2004, OSDI.

[6]  Christof Fetzer,et al.  ANB- and ANBDmem-Encoding: Detecting Hardware Errors in Software , 2010, SAFECOMP.

[7]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[8]  Doug Lea,et al.  Concurrent Programming In Java , 1996 .

[9]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[10]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[11]  Scott A. Mahlke,et al.  Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.

[12]  Edward A. Lee The problem with threads , 2006, Computer.

[13]  Michael Norrish,et al.  seL4: formal verification of an OS kernel , 2009, SOSP '09.

[14]  Puneet Gupta,et al.  Hardware Variability-Aware Duty Cycling for Embedded Sensors , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15]  Stefan Götz,et al.  Unmodified Device Driver Reuse and Improved System Dependability via Virtual Machines , 2004, OSDI.

[16]  D. B. Davis,et al.  Intel Corp. , 1993 .

[17]  Takeshi Yoshimura,et al.  Is Linux Kernel Oops Useful or Not? , 2012, HotDep.

[18]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[19]  John P. Hayes,et al.  Low-cost on-line fault detection using control flow assertions , 2003, 9th IEEE On-Line Testing Symposium, 2003. IOLTS 2003..

[20]  John R. Levine Linkers and Loaders , 1999 .

[21]  Jeffrey Overbey,et al.  A type and effect system for deterministic parallel Java , 2009, OOPSLA 2009.

[22]  Quinn Jacobson,et al.  ERSA: error resilient system architecture for probabilistic applications , 2010, DATE 2010.

[23]  Gerald J. Popek,et al.  Formal requirements for virtualizable third generation architectures , 1974, SOSP '73.

[24]  Karthik Pattabiraman,et al.  Towards understanding the effects of intermittent hardware faults on programs , 2010, 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W).

[25]  Michael Engel,et al.  Investigating the Limitations of PVF for Realistic Program Vulnerability Assessment , 2012 .

[26]  Andrew M. Tyrrell Recovery blocks and algorithm-based fault tolerance , 1996, Proceedings of EUROMICRO 96. 22nd Euromicro Conference. Beyond 2000: Hardware and Software Design Strategies.

[27]  Christophe Calvès,et al.  Faults in linux: ten years later , 2011, ASPLOS XVI.

[28]  Y. C. Yeh,et al.  Triple-triple redundant 777 primary flight computer , 1996, 1996 IEEE Aerospace Applications Conference. Proceedings.

[29]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[30]  Ravishankar K. Iyer,et al.  Active replication of multithreaded applications , 2006, IEEE Transactions on Parallel and Distributed Systems.

[31]  Udo Steinberg,et al.  NOVA: a microhypervisor-based secure virtualization architecture , 2010, EuroSys '10.

[32]  Samuel T. King,et al.  Recovery domains: an organizing principle for recoverable operating systems , 2009, ASPLOS.

[33]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[34]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[35]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[36]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[37]  Bogdan M. Wilamowski,et al.  The Transmission Control Protocol , 2005, The Industrial Information Technology Handbook.

[38]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[39]  Matt Davis Creating a vDSO: the colonel's other chicken , 2011 .

[40]  Narayanan Ganapathy,et al.  General Purpose Operating System Support for Multiple Page Sizes , 1998, USENIX Annual Technical Conference.

[41]  Christof Fetzer,et al.  AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware , 2009, SAFECOMP.

[42]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[43]  Tryggve Fossum,et al.  Cache scrubbing in microprocessors: myth or necessity? , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[44]  Ralph Johnson,et al.  design patterns elements of reusable object oriented software , 2019 .

[45]  Leonid Ryzhyk,et al.  Automatic device driver synthesis with termite , 2009, SOSP '09.

[46]  Asim Kadav,et al.  Tolerating hardware device failures in software , 2009, SOSP '09.

[47]  Hermann Härtig,et al.  Position summary: a streaming interface for real-time interprocess communication , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[48]  Raphael R. Some,et al.  Experimental evaluation of a COTS system for space applications , 2002, Proceedings International Conference on Dependable Systems and Networks.

[49]  A. Taber,et al.  Single event upset in avionics , 1993 .

[50]  Yang Wang,et al.  All about Eve: Execute-Verify Replication for Multi-Core Servers , 2012, OSDI.

[51]  Hermann Härtig,et al.  Where Have all the Cycles Gone? - Investigating Runtime Overheads of OSAssisted Replication , 2013, GI-Jahrestagung.

[52]  Dan Grossman,et al.  CoreDet: a compiler and runtime system for deterministic multithreaded execution , 2010, ASPLOS 2010.

[53]  Michael N. Lovellette,et al.  Strategies for fault-tolerant, space-based computing: Lessons learned from the ARGOS testbed , 2002, Proceedings, IEEE Aerospace Conference.

[54]  Michael S. Floyd,et al.  Fault - tolerant design of the IBM POWER6™ microprocessor , 2007, 2007 IEEE Hot Chips 19 Symposium (HCS).

[55]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[56]  Marek Olszewski,et al.  Kendo: efficient deterministic multithreading in software , 2009, ASPLOS.

[57]  Hermann Härtig,et al.  Who Watches the Watchmen? Protecting Operating System Reliability Mechanisms , 2012, HotDep.

[58]  Junfeng Yang,et al.  Stable Deterministic Multithreading through Schedule Memoization , 2010, OSDI.

[59]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[60]  William J. Bolosky,et al.  Mach: A New Kernel Foundation for UNIX Development , 1986, USENIX Summer.

[61]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[62]  Sarita V. Adve,et al.  Low-cost program-level detectors for reducing silent data corruptions , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[63]  Dong Zhou,et al.  Rex: replication at the speed of multi-core , 2014, EuroSys '14.

[64]  Sani R. Nassif The light at the end of the CMOS tunnel , 2010, ASAP.

[65]  Saibal Mukhopadhyay,et al.  Leakage current mechanisms and leakage reduction techniques in deep-submicrometer CMOS circuits , 2003, Proc. IEEE.

[66]  Thomas F. Knight,et al.  A Minimal Trusted Computing Base for Dynamically Ensuring Secure Information Flow , 2001 .

[67]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[68]  Sarita V. Adve,et al.  Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults , 2012, ASPLOS XVII.

[69]  Jean Arlat,et al.  Dependability of COTS Microkernel-Based Systems , 2002, IEEE Trans. Computers.

[70]  Christof Fetzer,et al.  Software-Implemented Hardware Error Detection: Costs and Gains , 2010, 2010 Third International Conference on Dependability.

[71]  Rolf Ernst,et al.  Designing an Analyzable and Resilient Embedded Operating System , 2012, GI-Jahrestagung.

[72]  Mateo Valero,et al.  FIMSIM: A fault injection infrastructure for microarchitectural simulators , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[73]  Ravishankar K. Iyer,et al.  Error sensitivity of the Linux kernel executing on PowerPC G4 and Pentium 4 processors , 2004, International Conference on Dependable Systems and Networks, 2004.

[74]  Virendra J. Marathe,et al.  Callisto: co-scheduling parallel runtime systems , 2014, EuroSys '14.

[75]  J. Ziegler,et al.  Effect of Cosmic Rays on Computer Memories , 1979, Science.

[76]  Roy H. Campbell,et al.  CuriOS: Improving Reliability through Operating System Structure , 2008, OSDI.

[77]  Muhammad Shafique,et al.  Reliable software for unreliable hardware: Embedded code generation aiming at reliability , 2011, 2011 Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[78]  David R. Kaeli,et al.  Quantifying software vulnerability , 2008, WREFT '08.

[79]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[80]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[81]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[82]  Olaf Spinczyk,et al.  Protecting the Dynamic Dispatch in C++ by Dependability Aspects , 2012, GI-Jahrestagung.

[83]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[84]  Hermann Härtig,et al.  Can we put concurrency back into redundant multithreading? , 2014, 2014 International Conference on Embedded Software (EMSOFT).

[85]  Michael Stumm,et al.  FlexSC: Flexible System Call Scheduling with Exception-Less System Calls , 2010, OSDI.

[86]  Olaf Spinczyk,et al.  Generative software-based memory error detection and correction for operating system data structures , 2013, 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[87]  Bryan Ford,et al.  Deterministic OpenMP for Race-Free Parallelism , 2011, HotPar.

[88]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[89]  J. Maiz,et al.  Characterization of multi-bit soft error events in advanced SRAMs , 2003, IEEE International Electron Devices Meeting 2003.

[90]  John Paul Shen,et al.  Continuous signature monitoring: low-cost concurrent detection of processor control errors , 1990, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[91]  Josep Torrellas,et al.  Light64: Lightweight hardware support for data race detection during Systematic Testing of parallel programs , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[92]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[93]  Babak Falsafi,et al.  Fingerprinting: Bounding Soft-Error-Detection Latency and Bandwidth , 2004, IEEE Micro.

[94]  Herbert Bos,et al.  Keep net working - on a dependable and fast networking stack , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[95]  Jochen Liedtke,et al.  Improving IPC by kernel design , 1994, SOSP '93.

[96]  Jakob Engblom,et al.  The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[97]  Maurice Herlihy,et al.  The art of multiprocessor programming , 2020, PODC '06.

[98]  Rolf Ernst,et al.  Response-Time Analysis of Parallel Fork-Join Workloads with Real-Time Constraints , 2013, 2013 25th Euromicro Conference on Real-Time Systems.

[99]  Cristiano Giuffrida,et al.  We Crashed, Now What? , 2010, HotDep.

[100]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[101]  Junfeng Yang,et al.  Parrot: a practical runtime for deterministic, stable, and reliable threads , 2013, SOSP.

[102]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[103]  Qin Zhao,et al.  Practical memory checking with Dr. Memory , 2011, International Symposium on Code Generation and Optimization (CGO 2011).

[104]  Luis Ceze,et al.  Deterministic Process Groups in dOS , 2010, OSDI.

[105]  Michael Engel,et al.  Fast and Low-Cost Instruction-Aware Fault Injection , 2013, GI-Jahrestagung.

[106]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[107]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[108]  Rolf Ernst,et al.  IDAMC: A Many-Core Platform with Run-Time Monitoring for Mixed-Criticality , 2012, 2012 IEEE 14th International Symposium on High-Assurance Systems Engineering.

[109]  Rami G. Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[110]  Jakob Eriksson,et al.  Conversion: multi-version concurrency control for main memory segments , 2013, EuroSys '13.

[111]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[112]  Norbert Wehn,et al.  Reliable on-chip systems in the nano-era: Lessons learnt and future trends , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[113]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[114]  Sorav Bansal,et al.  Fast dynamic binary translation for the kernel , 2013, SOSP.

[115]  Doug Lea,et al.  Concurrent programming in Java - design principles and patterns , 1996, Java series.

[116]  Muhammad Shafique,et al.  Instruction scheduling for reliability-aware compilation , 2012, DAC Design Automation Conference 2012.

[117]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[118]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[119]  Albert Meixner,et al.  Detouring: Translating software to circumvent hard faults in simple cores , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[120]  Julian Stecklina Shrinking the hypervisor one subsystem at a time: a userspace packet switch for virtual machines , 2014, VEE '14.

[121]  J. N. Herder,et al.  Building a Dependable Operating System: Fault Tolerance in MINIX 3 , 2005 .

[122]  Maurice Herlihy,et al.  A methodology for implementing highly concurrent data objects , 1993, TOPL.

[123]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[124]  Adam Lackorzynski,et al.  L 4 Linux Porting Optimizations , 2004 .

[125]  Calton Pu,et al.  Buffer overflows: attacks and defenses for the vulnerability of the decade , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[126]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[127]  Gene Cooperman,et al.  DMTCP: Transparent checkpointing for cluster computations and the desktop , 2007, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[128]  Dong Li,et al.  Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[129]  James Hendricks,et al.  Secure bootstrap is not enough: shoring up the trusted computing base , 2004, EW 11.

[130]  Steve McConnell,et al.  Code complete - a practical handbook of software construction, 2nd Edition , 1993 .

[131]  G Gasiot,et al.  Altitude and underground real-time SER characterization of CMOS 65nm SRAM , 2008, 2008 European Conference on Radiation and Its Effects on Components and Systems.

[132]  Lingamneni Avinash,et al.  Sustaining moore's law in embedded computing through probabilistic and approximate design: retrospects and prospects , 2009, CASES '09.

[133]  Neal H. Walfield,et al.  Viengoos: A Framework for Stakeholder-Directed Resource Allocation , 2009 .

[134]  Frank Bellosa,et al.  XLH: More Effective Memory Deduplication Scanners Through Cross-layer Hints , 2013, USENIX Annual Technical Conference.

[135]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[136]  Jeffrey Overbey,et al.  A type and effect system for deterministic parallel Java , 2009, OOPSLA '09.

[137]  Yun Zhang,et al.  DAFT: decoupled acyclic fault tolerance , 2010, PACT '10.

[138]  Edward J. McCluskey,et al.  Executable assertions and flight software , 1984 .

[139]  Donald E. Knuth,et al.  The Art of Computer Programming: Volume 3: Sorting and Searching , 1998 .

[140]  PuCalton,et al.  Reducing TCB complexity for security-sensitive applications , 2006 .

[141]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[142]  Dawson R. Engler,et al.  Bugs as deviant behavior: a general approach to inferring errors in systems code , 2001, SOSP.

[143]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[144]  Dirk Vogt,et al.  Stay strong, stay safe: Enhancing Reliability of a Secure Operating System , 2010 .

[145]  Leonid Ryzhyk,et al.  Dingo: taming device drivers , 2009, EuroSys '09.

[146]  P. Roche,et al.  Altitude and Underground Real-Time SER Characterization of CMOS 65 nm SRAM , 2008, IEEE Transactions on Nuclear Science.

[147]  Norbert Wehn,et al.  A Cross-Layer Technology-Based Study of How Memory Errors Impact System Resilience , 2013, IEEE Micro.

[148]  Trent Jaeger,et al.  The SawMill framework for virtual memory diversity , 2001, Proceedings 6th Australasian Computer Systems Architecture Conference. ACSAC 2001.

[149]  Bryan Ford,et al.  Workspace Consistency : A Programming Model for Shared Memory Parallelism , 2011 .

[150]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[151]  Y. Taur,et al.  The incredible shrinking transistor , 1999, IEEE Spectrum.

[152]  Dieter K. Schroder,et al.  Negative bias temperature instability: What do we understand? , 2007, Microelectron. Reliab..

[153]  Emery D. Berger,et al.  Dthreads: efficient deterministic multithreading , 2011, SOSP.

[154]  A. Asenov,et al.  Analysis of Threshold Voltage Distribution Due to Random Dopants: A 100 000-Sample 3-D Simulation Study , 2009, IEEE Transactions on Electron Devices.

[155]  Sen Hu,et al.  Efficient system-enforced deterministic parallelism , 2010, OSDI.

[156]  Julien Delange,et al.  POK, an ARINC653-compliant operating system released under the BSD license , 2011 .

[157]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[158]  Rüdiger Kapitza,et al.  Fail∗: Towards a versatile fault-injection experiment framework , 2012, ARCS 2012.

[159]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[160]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[161]  William G. Brown,et al.  Improvement of Electronic-Computer Reliability through the Use of Redundancy , 1961, IRE Trans. Electron. Comput..

[162]  Rolf Ernst,et al.  Failure analysis of a network-on-chip for real-time mixed-critical systems , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[163]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[164]  Satish Narayanasamy,et al.  Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism , 2010, ASPLOS 2010.

[165]  Mark S. Miller,et al.  Capability Myths Demolished , 2003 .

[166]  Calton Pu,et al.  Reducing TCB complexity for security-sensitive applications: three case studies , 2006, EuroSys.

[167]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[168]  L. Sterpone,et al.  An Analysis of SEU Effects in Embedded Operating Systems for Real-Time Applications , 2007, 2007 IEEE International Symposium on Industrial Electronics.

[169]  K ReinhardtSteven,et al.  Transient fault detection via simultaneous multithreading , 2000 .

[170]  Brian N. Bershad,et al.  Recovering device drivers , 2004, TOCS.

[171]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[172]  Martin Kriegel Bounding Error Detection Latencies for Replicated Execution , 2013 .

[173]  Adam Lackorzynski,et al.  Taming subsystems: capabilities as universal resource access control in L4 , 2009, IIES '09.

[174]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[175]  Sanjay J. Patel,et al.  Y-branches: when you come to a fork in the road, take it , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[176]  Gernot Heiser,et al.  From L3 to seL4 what have we learnt in 20 years of L4 microkernels? , 2013, SOSP.

[177]  Tobias Distler,et al.  Storyboard: Optimistic Deterministic Multithreading , 2010, HotDep.

[178]  Paul D. Ezhilchelvan,et al.  Implementing Fail-Silent Nodes for Distributed Systems , 1996, IEEE Trans. Computers.

[179]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[180]  Zaid Al-Ars,et al.  Efficient software-based fault tolerance approach on multicore platforms , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[181]  D. Varghese,et al.  A comprehensive model for PMOS NBTI degradation: Recent progress , 2007, Microelectron. Reliab..

[182]  Miguel Miranda When every atom counts , 2012, IEEE Spectrum.

[183]  J. Black,et al.  Electromigration—A brief survey and some recent results , 1969 .

[184]  Konstantin Serebryany,et al.  ThreadSanitizer: data race detection in practice , 2009, WBIA '09.

[185]  Emery D. Berger,et al.  Grace: safe multithreaded programming for C/C++ , 2009, OOPSLA 2009.

[186]  Timothy G. Mattson,et al.  Light-weight communications on Intel's single-chip cloud computer processor , 2011, OPSR.

[187]  J. Liou,et al.  A model for MOS failure prediction due to hot-carriers injection , 1996, Proceedings 1996 IEEE Hong Kong Electron Devices Meeting.

[188]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[189]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[190]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[191]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[192]  J Keane,et al.  An odomoeter for CPUs , 2011, IEEE Spectrum.

[193]  Philip Koopman,et al.  32-bit cyclic redundancy codes for Internet applications , 2002, Proceedings International Conference on Dependable Systems and Networks.

[194]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[195]  Michael Engel,et al.  The Reliable Computing Base - A Paradigm for Software-based Reliability , 2012, GI-Jahrestagung.

[196]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[197]  Carsten Weinhold jVPFS: Adding Robustness to a Secure Stacked File System with Untrusted Local Storage Components , 2011, USENIX Annual Technical Conference.

[198]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[199]  Karthikeyan Sankaralingam,et al.  Dark silicon and the end of multicore scaling , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[200]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[201]  Michael Stumm,et al.  Otherworld: giving applications a chance to survive OS kernel crashes , 2010, EuroSys '10.