ROSY: recovering processor and memory systems from hard errors

In the nanometer era, there has been a steady decline in the semiconductor chip manufacturing yield due to various contributing factors, such as wearout and defects due to complex processes. One of the strategies to alleviate this issue is to recover and use faulty hardware at gracefully degraded performance. A common, though naive, recovery strategy followed in the context of general purpose multicore systems is to disable the cores with faults and use only the fully functional cores. Such a coarse-granular solution is suboptimal, as the disabled cores would have many working modules which go un-utilized. The Resurrecting Operating SYstem (ROSY) presented in this paper is a step towards the development of an operating system that can work on faulty cores by adapting itself to hardware faults using software workarounds, and and utilize their working components. We consider many realistic fault models and present software workarounds for them. We have developed a framework which can be trivially plugged into a fullyfeatured x86 based OS kernel to demonstrate the feasibility of the proposed ideas. Performance evaluation using SPEC benchmarks and real-world applications show that the performance degradation of the depleted cores executing ROSY is on an average between 1.6x to 4x, depending on the fault type.

[1]  Babak Falsafi,et al.  Detecting Emerging Wearout Faults , 2007 .

[2]  Matt T. Yourst PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[3]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[4]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5]  Doug Burger,et al.  Exploiting microarchitectural redundancy for defect tolerance , 2003, Proceedings 21st International Conference on Computer Design.

[6]  Sule Ozev,et al.  Tolerating hard faults in microprocessor array structures , 2004, International Conference on Dependable Systems and Networks, 2004.

[7]  Albert Meixner,et al.  Detouring: Translating software to circumvent hard faults in simple cores , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[8]  D. Weiss,et al.  The on-chip 3 MB subarray based 3rd level cache on an Itanium microprocessor , 2002, 2002 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.02CH37315).

[9]  Suk Joo Bae,et al.  Yield prediction via spatial modeling of clustered defect counts across a wafer map , 2007 .

[10]  Omer Khan,et al.  Thread Relocation: A Runtime Architecture for Tolerating Hard Errors in Chip Multiprocessors , 2010, IEEE Transactions on Computers.

[11]  Shantanu Gupta,et al.  Architectural core salvaging in a multi-core processor for hard-error tolerance , 2009, ISCA '09.

[12]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[13]  T. N. Vijaykumar,et al.  Rescue: a microarchitecture for testability and defect tolerance , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[14]  Sule Ozev,et al.  Online diagnosis of hard faults in microprocessors , 2007, TACO.

[15]  Yervant Zorian,et al.  2003 technology roadmap for semiconductors , 2004, Computer.

[16]  Daniel J. Sorin,et al.  Core Cannibalization Architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[17]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[18]  Robert P. Colwell The Pentium Chronicles , 2005 .

[19]  Amin Ansari,et al.  StageNet: A Reconfigurable Fabric for Constructing Dependable CMPs , 2011, IEEE Transactions on Computers.