QED: Quick Error Detection tests for effective post-silicon validation

Long error detection latency, the time elapsed between the occurrence of an error caused by a bug and its manifestation as a system-level failure, is a major challenge in post-silicon validation of robust systems. In this paper, we present a new technique called Quick Error Detection (QED), which transforms existing post-silicon validation tests into new validation tests that significantly reduce error detection latency. QED transformations allow flexible tradeoffs between error detection latency, coverage, and complexity, and can be implemented in software with little or no hardware changes. Results obtained from hardware experiments on quad-core Intel® Core™ i7 hardware platforms and from simulations on a multi-core MIPS processor design demonstrate that: 1. QED significantly improves error detection latencies by six orders of magnitude, i.e., from billions of cycles to a few thousand cycles or less. 2. QED transformations do not degrade the coverage of validation tests as estimated empirically by measuring the maximum operating frequencies over a wide range of operating voltage points. 3. QED tests improve coverage by detecting errors that escape the original non-QED tests.

[1]  Giovanni Squillero,et al.  An Effective Technique for the Automatic Generation of Diagnosis-Oriented Programs for Processor Cores , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[2]  Don Douglas Josephson,et al.  Debug methodology for the McKinley processor , 2001, Proceedings International Test Conference 2001 (Cat. No.01CH37260).

[3]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[4]  David J. Lu Watchdog Processors and Structural Integrity Checking , 1982, IEEE Transactions on Computers.

[5]  Nur A. Touba,et al.  Automated Selection of Signals to Observe for Efficient Silicon Debug , 2009, 2009 27th IEEE VLSI Test Symposium.

[6]  Doug Josephson,et al.  The good, the bad, and the ugly of silicon debug , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[7]  Nicola Nicolici,et al.  On using lossless compression of debug data in embedded logic analysis , 2007, 2007 IEEE International Test Conference.

[8]  Sharad Malik,et al.  Complementary use of runtime validation and model checking , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[9]  Nur A. Touba,et al.  Expanding Trace Buffer Observation Window for In-System Silicon Debug through Selective Capture , 2008, 26th IEEE VLSI Test Symposium (vts 2008).

[10]  Edward J. McCluskey,et al.  ED4I: Error Detection by Diverse Data and Duplicated Instructions , 2002, IEEE Trans. Computers.

[11]  Ravishankar K. Iyer,et al.  An architectural framework for providing reliability and security support , 2004, International Conference on Dependable Systems and Networks, 2004.

[12]  Sanjay J. Patel,et al.  ReStore: Symptom-Based Soft Error Detection in Microprocessors , 2006, IEEE Trans. Dependable Secur. Comput..

[13]  Sujit Dey,et al.  Embedded Software-Based Self-Test for Programmable Core-Based Designs , 2002, IEEE Des. Test Comput..

[14]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[15]  Kunle Olukotun,et al.  Digital system simulation: methodologies and examples , 1998, Proceedings 1998 Design and Automation Conference. 35th DAC. (Cat. No.98CH36175).

[16]  Valeria Bertacco,et al.  Reversi: Post-silicon validation system for modern microprocessors , 2008, 2008 IEEE International Conference on Computer Design.

[17]  B. Bentley Validating The Intel Pentium 4 Processor 1 Validating The Intel ® Pentium ® 4 Processor , 2022 .

[18]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[19]  Andreas G. Veneris,et al.  Automated data analysis solutions to silicon debug , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[20]  Robert F. Molyneaux,et al.  Random self-test method applications on PowerPC/sup TM/ microprocessor caches , 1998, Proceedings of the 8th Great Lakes Symposium on VLSI (Cat. No.98TB100222).

[21]  Priyadarsan Patra On the cusp of a validation wall , 2007, IEEE Design & Test of Computers.

[22]  Edward J. McCluskey,et al.  Dependable adaptive computing systems-the ROAR project , 1998, SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218).

[23]  Srikanth Venkataraman,et al.  Automated Debug of Speed Path Failures Using Functional Tests , 2009, 2009 27th IEEE VLSI Test Symposium.

[24]  Subhasish Mitra,et al.  Post-Silicon Bug Localization in Processors Using Instruction Footprint Recording and Analysis (IFRA) , 2009, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[25]  Daniel P. Siewiorek,et al.  Fault Injection Experiments Using FIAT , 1990, IEEE Trans. Computers.

[26]  Zeljko Zilic,et al.  Assertion Checkers in Verification, Silicon Debug and In-Field Diagnosis , 2007, 8th International Symposium on Quality Electronic Design (ISQED'07).

[27]  Qiang Xu,et al.  Trace signal selection for visibility enhancement in post-silicon validation , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[28]  Alfredo Benso,et al.  March Test Generation Revealed , 2008, IEEE Transactions on Computers.

[29]  Kwang-Ting Cheng,et al.  Time-Multiplexed Online Checking: A Feasibility Study , 2008, 2008 17th Asian Test Symposium.

[30]  Stephen McCamant,et al.  The Daikon system for dynamic detection of likely invariants , 2007, Sci. Comput. Program..

[31]  Edward J. McCluskey,et al.  A Design Diversity Metric and Analysis of Redundant Systems , 2002, IEEE Trans. Computers.

[32]  Gérard Memmi,et al.  A reconfigurable design-for-debug infrastructure for SoCs , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[33]  Alan J. Hu,et al.  BackSpace: Formal Analysis for Post-Silicon Debug , 2008, 2008 Formal Methods in Computer-Aided Design.

[34]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[35]  J. P. Grossman,et al.  Post-Silicon Debug Using Formal Verification Waypoints , 2009 .

[36]  Hong Wang,et al.  BLoG: Post-Silicon bug localization in processors using bug localization graphs , 2010, Design Automation Conference.

[37]  Intel Platform and Component Validation , 2003 .

[38]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[39]  Sanjit A. Seshia,et al.  Scalable specification mining for verification and diagnosis , 2010, Design Automation Conference.

[40]  Edward J. McCluskey,et al.  Linear Complexity Assertions for Sorting , 1994, IEEE Trans. Software Eng..

[41]  Kwang-Ting Cheng,et al.  Diagnosis-based post-silicon timing validation using statistical tools and methodologies , 2003, International Test Conference, 2003. Proceedings. ITC 2003..

[42]  Preeti Ranjan Panda,et al.  Cache aware compression for processor debug support , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[43]  Nicola Nicolici,et al.  On Automated Trigger Event Generation in Post-Silicon Validation , 2008, 2008 Design, Automation and Test in Europe.

[44]  Edward J. McCluskey,et al.  Dependable Computing and Online Testing in Adaptive and Configurable Systems , 2000, IEEE Des. Test Comput..

[45]  Maged M. Michael,et al.  Simple, fast, and practical non-blocking and blocking concurrent queue algorithms , 1996, PODC '96.

[46]  B. Hoefflinger ITRS: The International Technology Roadmap for Semiconductors , 2011 .

[47]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[48]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[49]  Don Douglas Josephson The manic depression of microprocessor debug , 2002, Proceedings. International Test Conference.

[50]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[51]  William Lindsay,et al.  FRITS - a microprocessor functional BIST method , 2002, Proceedings. International Test Conference.

[52]  Sridhar Narayanan,et al.  IODINE: a tool to automatically infer dynamic invariants for hardware designs , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[53]  Jian Shen,et al.  Native mode functional test generation for processors with applications to self test and design validation , 1998, Proceedings International Test Conference 1998 (IEEE Cat. No.98CH36270).