LDet: Determinizing Asynchronous Transfer for Postsilicon Debugging

To efficiently and effectively debug silicon bugs, a promising solution is to determinize the chip, so that the buggy silicon behaviors can be faithfully reproduced on a RTL simulator. In this paper, we propose a novel scheme, named LDet, to determinize a chip through removing the nondeterminism in transfers crossing different clock domains, even when these clock domains are heterochronous. The key insight of LDet is that we can slightly adjust the frequencies of clocks at runtime so that the actual frequency ratio between two clocks always approaches a rational constant with bounded accumulated error. With the technique called dynamic frequency adjusting, the processing time of each asynchronous transfer can be determinized with deterministic asynchronous fifo (DAF). As a consequence, the behavior of the whole chip is deterministic, thus the chip behavior can be reproduced on the RTL simulator (given the same initial state and input sequence). We implement LDet on the RTL design of a processor chip with many clock domains. Experiments show that on average, LDet only causes about one cycle of additional latency to each asynchronous transfer. As a result, LDet only incurs a negligible performance overhead of about 0.7 percent slowdown. Moreover, LDet only brings less than 0.2 percent additional area to the chip. The low performance and area overheads of LDet well demonstrate its applicability in industry.

[1]  R. Kumar,et al.  An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[2]  Dan Grossman,et al.  RCDC: a relaxed consistency deterministic computer , 2011, ASPLOS XVI.

[3]  Jeremy A. Rowlette,et al.  Critical timing analysis in microprocessors using near-ir laser assisted device alteration (lada) , 2003, International Test Conference, 2003. Proceedings. ITC 2003..

[4]  S. Tam,et al.  Clock Generation and Distribution of a Dual-Core Xeon Processor with 16MB L3 Cache , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[5]  Lokesh Sharma,et al.  A 32nm Westmere-EX Xeon® enterprise processor , 2011, 2011 IEEE International Solid-State Circuits Conference.

[6]  Marcelo Yuffe,et al.  The Implementation of the 65nm Dual-Core 64b Merom Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[7]  Josep Torrellas,et al.  CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[8]  Satish Narayanasamy,et al.  Recording shared memory dependencies using strata , 2006, ASPLOS XII.

[9]  Jian Wang,et al.  Micro-architecture of Godson-3 multi-core processor , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[10]  Satish Narayanasamy,et al.  Patching Processor Design Errors with Programmable Hardware , 2007, IEEE Micro.

[11]  Wayne P. Burleson,et al.  Synchro-tokens: a deterministic GALS methodology for chip-level debug and test , 2005, IEEE Transactions on Computers.

[12]  Luis Ceze,et al.  Deterministic Process Groups in dOS , 2010, OSDI.

[13]  Ing-Jer Huang,et al.  A multi-resolution AHB bus tracer for real-time compression of forward/backward traces in a circular buffer , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[14]  X. Dai An adaptive digital deskewing circuit for clock distribution networks , 1998, 1998 IEEE International Solid-State Circuits Conference. Digest of Technical Papers, ISSCC. First Edition (Cat. No.98CH36156).

[15]  Xiang Gao,et al.  A general method to make multi-clock system deterministic , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[16]  Brandon Lucia,et al.  DMP: deterministic shared memory multiprocessing , 2009, IEEE Micro.

[17]  Tianshi Chen,et al.  Statistical Performance Comparisons of Computers , 2012, IEEE Transactions on Computers.

[18]  G.E. Moore,et al.  Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.

[19]  Ashok Kumar,et al.  An 8-Core 64-Thread 64b Power-Efficient SPARC SoC , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[20]  T. N. Vijaykumar,et al.  Timetraveler: exploiting acyclic races for optimizing memory race recording , 2010, ISCA.

[21]  Doug Josephson,et al.  The good, the bad, and the ugly of silicon debug , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[22]  Subhasish Mitra,et al.  Post-silicon bug localization for processors using IFRA , 2010, Commun. ACM.

[23]  Tianshi Chen,et al.  LReplay: a pending period based deterministic replay scheme , 2010, ISCA.

[24]  N. Kurd,et al.  Next Generation Intel¯ Core™ Micro-Architecture (Nehalem) Clocking , 2009, IEEE Journal of Solid-State Circuits.

[25]  Satish Narayanasamy,et al.  DoublePlay: parallelizing sequential logging and replay , 2011, ASPLOS XVI.

[26]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[27]  Kees G. W. Goossens,et al.  You can catch more bugs with transaction level honey , 2008, CODES+ISSS '08.

[28]  Yoshinori Okajima,et al.  Digital Delay Locked Loop and Design Technique for High-Speed Synchronous Interface (Special Issue on ULSI Memory Technology) , 1996 .

[29]  Ian G. Harris,et al.  A deterministic globally asynchronous locally synchronous microprocessor architecture , 2003, Proceedings. 4th International Workshop on Microprocessor Test and Verification - Common Challenges and Solutions.

[30]  I.G. Harris,et al.  Synchro-tokens: eliminating nondeterminism to enable chip-level test of globally-asynchronous SoC's , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[31]  T. Xanthopoulos,et al.  The design and analysis of the clock distribution network for a 1.2 GHz Alpha microprocessor , 2001, 2001 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC (Cat. No.01CH37177).

[32]  E. Fluhr,et al.  Design and Implementation of the POWER6 Microprocessor , 2008, IEEE Journal of Solid-State Circuits.

[33]  Ing-Jer Huang,et al.  An Embedded Multi-resolution AMBA Trace Analyzer for Microprocessor-based SoC Integration , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[34]  Saurabh Dighe,et al.  An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[35]  Jian Wang,et al.  Godson-3: A Scalable Multicore RISC Processor with x86 Emulation , 2009, IEEE Micro.

[36]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, 2008 International Symposium on Computer Architecture.

[37]  Xu Yang,et al.  Godson-3B: A 1GHz 40W 8-core 128GFLOPS processor in 65nm CMOS , 2011, 2011 IEEE International Solid-State Circuits Conference.

[38]  Josep Torrellas,et al.  DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Ef?ciently , 2008, International Symposium on Computer Architecture.

[39]  Min Xu,et al.  A regulated transitive reduction (RTR) for longer memory race recording , 2006, ASPLOS XII.

[40]  Yu-Che Yang,et al.  A Quantization Noise Suppression Technique for$DeltaSigma$Fractional-$N$Frequency Synthesizers , 2006, IEEE Journal of Solid-State Circuits.

[41]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[42]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[43]  Kenneth L. Shepard,et al.  Design and Analysis of Actively-Deskewed Resonant Clock Networks , 2009, IEEE Journal of Solid-State Circuits.

[44]  Nur A. Touba,et al.  Eliminating non-determinism during test of high-speed source synchronous differential buses , 2003, Proceedings. 21st VLSI Test Symposium, 2003..

[45]  Todd J. Foster,et al.  First Silicon Functional Validation and Debug of Multicore Microprocessors , 2007, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.