LReplay: a pending period based deterministic replay scheme

Debugging parallel program is a well-known difficult problem. A promising method to facilitate debugging parallel program is using hardware support to achieve deterministic replay. A hardware-assisted deterministic replay scheme should have a small log size, as well as low design cost, to be feasible for adopting by industrial processors. To achieve the goals, we propose a novel and succinct hardware-assisted deterministic replay scheme named LReplay. The key innovation of LReplay is that instead of recording the logical time orders between instructions or instruction blocks as previous investigations, LReplay is built upon recording the pending period information [6]. According to the experimental results on Godson-3, the overall log size of LReplay is about 0.55B/K-Inst (byte per k-instruction) for sequential consistency, and 0.85B/K-Inst for Godson-3 consistency. The log size is smaller in an order of magnitude than state-of-art deterministic replay schemes incuring no performance loss. Furthermore, LReplay only consumes about $1.3%$ area of Godson-3, since it requires only trivial modifications to the existing components of Godson-3. The above features of LReplay demonstrate the potential of integrating hardware-assisted deterministic replay into future industrial processors.

[1]  Josep Torrellas,et al.  Capo: a software-hardware interface for practical deterministic multiprocessor replay , 2009, ASPLOS.

[2]  Min Xu,et al.  A "flight data recorder" for enabling full-system multiprocessor deterministic replay , 2003, ISCA '03.

[3]  Xuezheng Liu,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation R2: an Application-level Kernel for Record and Replay , 2022 .

[4]  Wenguang Chen,et al.  PHANTOM: predicting performance of parallel applications on large-scale parallel machines using a single node , 2010, PPoPP '10.

[5]  Phillip B. Gibbons,et al.  On testing cache-coherent shared memories , 1994, SPAA '94.

[6]  Doug Josephson,et al.  The good, the bad, and the ugly of silicon debug , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[7]  Chih-Tsun Huang,et al.  An embedded infrastructure of debug and trace interface for the DSP platform , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[8]  Wenguang Chen,et al.  MPIWiz: subgroup reproducible replay of mpi applications , 2009, PPoPP '09.

[9]  Arvind,et al.  Memory Model = Instruction Reordering + Store Atomicity , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[10]  Tianshi Chen,et al.  Fast complete memory consistency verification , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[11]  Derek Hower,et al.  Rerun: Exploiting Episodes for Lightweight Memory Race Recording , 2008, 2008 International Symposium on Computer Architecture.

[12]  Todd J. Foster,et al.  First Silicon Functional Validation and Debug of Multicore Microprocessors , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[13]  Brandon Lucia,et al.  DMP: Deterministic Shared-Memory Multiprocessing , 2010, IEEE Micro.

[14]  Michel Dubois,et al.  Memory access buffering in multiprocessors , 1998, ISCA '98.

[15]  Samuel T. King,et al.  ReVirt: enabling intrusion analysis through virtual-machine logging and replay , 2002, OPSR.

[16]  Ing-Jer Huang,et al.  An Embedded Multi-resolution AMBA Trace Analyzer for Microprocessor-based SoC Integration , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[17]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[18]  Jian Wang,et al.  Godson-3: A Scalable Multicore RISC Processor with x86 Emulation , 2009, IEEE Micro.

[19]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[20]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[21]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[22]  Min Xu,et al.  A regulated transitive reduction (RTR) for longer memory race recording , 2006, ASPLOS XII.

[23]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[24]  Robert H. B. Netzer Optimal tracing and replay for debugging shared-memory parallel programs , 1993, PADD '93.

[25]  Ion Stoica,et al.  ODR: output-deterministic replay for multicore debugging , 2009, SOSP '09.

[26]  Kees G. W. Goossens,et al.  You can catch more bugs with transaction level honey , 2008, CODES+ISSS '08.

[27]  Michel Dubois,et al.  Correct memory operation of cache-based multiprocessors , 1987, ISCA '87.

[28]  Seth Copen Goldstein,et al.  Hardware-assisted replay of multiprocessor programs , 1991, PADD '91.

[29]  Amitabha Roy,et al.  Fast and Generalized Polynomial Time Memory Consistency Verification , 2006, CAV.

[30]  Tianshi Chen,et al.  Global Clock, Physical Time Order and Pending Period Analysis in Multiprocessor Systems , 2009, ArXiv.

[31]  Thomas J. LeBlanc,et al.  Debugging Parallel Programs with Instant Replay , 1987, IEEE Transactions on Computers.

[32]  Satish Narayanasamy,et al.  BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[33]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[34]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[35]  Satish Narayanasamy,et al.  Recording shared memory dependencies using strata , 2006, ASPLOS XII.

[36]  Pia Sanda,et al.  The attack of the "Holey Shmoos": a case study of advanced DFD and picosecond imaging circuit analysis (PICA) , 1999, International Test Conference 1999. Proceedings (IEEE Cat. No.99CH37034).

[37]  Barton P. Miller,et al.  On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions , 1990, ICPP.

[38]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[39]  Wi N Dows FLIGHT DATA RECORDER FOR , 2007 .