Caspar: Hardware patching for multicore processors

Ensuring correctness of execution of complex multi-core processor systems deployed in the field remains to this day an extremely challenging task. The major part of this effort is concentrated on design verification, where different pre- and post-silicon techniques are used to guarantee that devices behave exactly as stated in the specification. Unfortunately, the performance of even state-of-the-art validation tools lags behind the growing complexity of multi-core designs. Therefore, subtle bugs still slip into released components, causing incorrect computational results, or even compromising the security of the end-user systems. In this work we present Caspar - an approach for in-the-field patching of the memory subsystem hardware in multi-core chips. Caspar relies on a checkpointing system, which periodically logs the state of the chip, and a novel error detection and recovery scheme, which uses a simplified mode of operation to bypass cache coherence and consistency errors. The implementation of Caspar employs hardware detectors: on-die programmable circuits to identify system's configurations that may lead to bugs, and to trigger recovery and bypass. Our experimental results show that Caspar can be used effectively to detect and bypass a variety of memory subsystem bugs, with as little as 2% performance impact and 6% area overhead during bug-free operation.

[1]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[2]  Todd M. Austin DIVA: A Dynamic Approach to Microprocessor Verification , 2000, J. Instr. Level Parallelism.

[3]  Wei Chu,et al.  Gaussian Processes for Ordinal Regression , 2005, J. Mach. Learn. Res..

[4]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[5]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[6]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[7]  Kourosh Gharachorloo,et al.  Managing Complexity in the Piranha Server-Class Processor Design , 2001 .

[8]  P. McCullagh Regression Models for Ordinal Data , 1980 .

[9]  Thomas Hofmann,et al.  Unifying collaborative and content-based filtering , 2004, ICML.

[10]  Todd M. Austin,et al.  Using Field-Repairable Control Logic to Correct Design Errors in Microprocessors , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[11]  Albert Meixner,et al.  A: L-C, C E D S C , 2008 .

[12]  Tom Minka,et al.  A family of algorithms for approximate Bayesian inference , 2001 .

[13]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[14]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Albert Meixner,et al.  Dynamic Verification of Memory Consistency in Cache-Coherent Multithreaded Computer Architectures , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[16]  Satish Narayanasamy,et al.  Patching Processor Design Errors with Programmable Hardware , 2007, IEEE Micro.

[17]  Anton Schwaighofer,et al.  Learning Gaussian processes from multiple tasks , 2005, ICML.

[18]  R. Herbrich On Gaussian Expectation Propagation , 2005 .

[19]  Josep Torrellas,et al.  ReViveI/O: efficient handling of I/O in highly-available rollback-recovery servers , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[20]  Anoop Gupta,et al.  Parallel computer architecture - a hardware / software approach , 1998 .

[21]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .