IFRA: Post-silicon bug localization in processors

IFRA overcomes challenges associated with an expensive step in post-silicon validation of processors - pinpointing the bug location and the instruction sequence that exposes the bug from a system failure. On-chip recorders collect instruction footprints (information about flows of instructions, and what the instructions did as they passed through various design blocks) during the normal operation of the processor in a post-silicon system validation setup. Upon system failure, the recorded information is scanned out and analyzed off-line for bug localization. Special self-consistency-based program analysis techniques, together with the test program binary of the application executed during post-silicon validation, are used. Major benefits of using IFRA over traditional techniques for post-silicon bug localization are: 1. It does not require full system-level reproduction of bugs, and, 2. It does not require full system-level simulation. Simulation results on a complex super-scalar processor demonstrate that IFRA is effective in accurately localizing electrical bugs with very little impact on overall chip area.

[1]  Jinuk Luke Shin,et al.  The UltraSPARC T1 Processor: CMT Reliability , 2006, IEEE Custom Integrated Circuits Conference 2006.

[2]  Richard H. Livengood,et al.  Design for (physical) debug for silicon microsurgery and probing of flip-chip packaged integrated circuits , 1999, International Test Conference 1999. Proceedings (IEEE Cat. No.99CH37034).

[3]  Satish Narayanasamy,et al.  Patching Processor Design Errors with Programmable Hardware , 2007, IEEE Micro.

[4]  Sharad Malik,et al.  Runtime validation of memory ordering using constraint graph checking , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[5]  Alan J. Hu,et al.  BackSpace: Formal Analysis for Post-Silicon Debug , 2008, 2008 Formal Methods in Computer-Aided Design.

[6]  Ian G. Harris,et al.  Eliminating Nondeterminism to Enable Chip-Level Test of Globally-Asynchronous Locally-Synchronous SoC’s , 2003 .

[7]  Robert S. Swarz,et al.  Reliable Computer Systems: Design and Evaluation , 1992 .

[8]  Igor L. Markov,et al.  Automating post-silicon debugging and repair , 2007, ICCAD 2007.

[9]  Hiroyuki Sugiyama,et al.  A 1.3 GHz fifth generation SPARC64 microprocessor , 2003, 2003 IEEE International Solid-State Circuits Conference, 2003. Digest of Technical Papers. ISSCC..

[10]  Igor L. Markov,et al.  Automating Postsilicon Debugging and Repair , 2007, Computer.

[11]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[12]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[13]  Subhasish Mitra,et al.  IFRA: Instruction Footprint Recording and Analysis for post-silicon bug localization in processors , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[14]  Priyadarsan Patra On the cusp of a validation wall , 2007, IEEE Design & Test of Computers.

[15]  Todd M. Austin,et al.  Shielding against design flaws with field repairable control logic , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[16]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[17]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor , 2008, IBM J. Res. Dev..

[18]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[19]  I.G. Harris,et al.  Synchro-tokens: eliminating nondeterminism to enable chip-level test of globally-asynchronous SoC's , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[20]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[21]  Donal Heffernan,et al.  Emerging on-ship debugging techniques for real-time embedded systems , 2000 .

[22]  Mikko H. Lipasti,et al.  Modern Processor Design: Fundamentals of Superscalar Processors , 2002 .

[23]  Don Douglas Josephson,et al.  Debug methodology for the McKinley processor , 2001, Proceedings International Test Conference 2001 (Cat. No.01CH37260).

[24]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[25]  Josep Torrellas,et al.  CADRE: Cycle-Accurate Deterministic Replay for Hardware Debugging , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[26]  David J. Lu Watchdog Processors and Structural Integrity Checking , 1982, IEEE Transactions on Computers.

[27]  Subhasish Mitra,et al.  Post-silicon bug localization for processors using IFRA , 2010, Commun. ACM.

[28]  Ismet Bayraktaroglu,et al.  Microprocessor silicon debug based on failure propagation tracing , 2005, IEEE International Conference on Test, 2005..

[29]  Gérard Memmi,et al.  A reconfigurable design-for-debug infrastructure for SoCs , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[30]  Doug Josephson,et al.  The good, the bad, and the ugly of silicon debug , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[31]  Sharad Malik,et al.  Complementary use of runtime validation and model checking , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..