CoreRacer: A practical memory race recorder for multicore x86 TSO processors

Shared memory multiprocessors are difficult to program because of the non-deterministic ways in which the memory operations from different threads interleave. To address this issue, many hardware-based memory race recorders have been proposed that efficiently log an ordering of the shared memory interleavings between threads for deterministic replay. These approaches are challenging to integrate into current processors because they change the cache subsystem or the coherence protocol, and they mostly support a sequentially consistent memory model. In this paper, we describe CoreRacer, a chunk-based memory race recorder architecture for multicore x86 TSO processors. CoreRacer does not modify the cache subsystem and yet it still integrates into the x86 TSO memory model. We show that by leveraging a specific x86 feature, the invariant timestamp, CoreRacer maintains ordering among chunks without piggybacking on cache coherence messages. We provide a detailed implementation and evaluation of CoreRacer on a cycle-accurate x86 simulator. We show that its integration cost into x86 is minimal and its overhead has negligible effect on performance.

[1]  Peter M. Chen,et al.  Execution replay of multiprocessor virtual machines , 2008, VEE '08.

[2]  No License,et al.  Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[3]  Josep Torrellas,et al.  Capo: a software-hardware interface for practical deterministic multiprocessor replay , 2009, ASPLOS.

[4]  Samuel T. King,et al.  Operating System Support for Virtual Machines , 2003, USENIX Annual Technical Conference, General Track.

[5]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[6]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[7]  Ali-Reza Adl-Tabatabai,et al.  Architecting a chunk-based memory race recorder in Modern CMPs , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[8]  Min Xu,et al.  A regulated transitive reduction (RTR) for longer memory race recording , 2006, ASPLOS XII.

[9]  Satish Narayanasamy,et al.  Recording shared memory dependencies using strata , 2006, ASPLOS XII.

[10]  Josep Torrellas,et al.  BulkCompiler: High-performance Sequential Consistency through cooperative compiler and hardware support , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Srikanth Kandula,et al.  Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging , 2004, USENIX Annual Technical Conference, General Track.

[12]  Mikko H. Lipasti,et al.  Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[13]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[14]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[15]  Derek Hower,et al.  Rerun: Exploiting Episodes for Lightweight Memory Race Recording , 2008, 2008 International Symposium on Computer Architecture.

[16]  James Cownie,et al.  PinPlay: a framework for deterministic replay and reproducible analysis of parallel programs , 2010, CGO '10.

[17]  Tianshi Chen,et al.  LReplay: a pending period based deterministic replay scheme , 2010, ISCA.

[18]  Ion Stoica,et al.  ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.

[19]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[20]  Samuel T. King,et al.  HARDWARE AND SOFTWARE APPROACHES FOR DETERMINISTIC MULTI-PROCESSOR REPLAY OF CONCURRENT PROGRAMS , 2009 .

[21]  Mark D. Hill,et al.  Karma: scalable deterministic record-replay , 2011, ICS '11.

[22]  Satish Narayanasamy,et al.  BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[23]  Richard H. Carver,et al.  Replay and testing for concurrent programs , 1991, IEEE Software.

[24]  Satish Narayanasamy,et al.  Offline symbolic analysis to infer Total Store Order , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[25]  Daniel Sánchez,et al.  Implementing Signatures for Transactional Memory , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[26]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[27]  Josep Torrellas,et al.  Bulk Disambiguation of Speculative Threads in Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[28]  Ali-Reza Adl-Tabatabai,et al.  Virtues and Obstacles of Hardware-assisted Multi-processor Execution Replay , 2010 .

[29]  Yuanyuan Zhou,et al.  PRES: probabilistic replay with execution sketching on multiprocessors , 2009, SOSP '09.