Online design bug detection: RTL analysis, flexible mechanisms, and evaluation

Higher level of resource integration and the addition of new features in modern multi-processors put a significant pressure on their verification. Although a large amount of resources and time are devoted to the verification phase of modern processors, many design bugs escape the verification process and slip into processors operating in the field. These design bugs often lead to lower quality products, lower customer satisfaction, diminishing brand/company reputation, or even expensive product recalls. This paper proposes a flexible, low-overhead mechanism to detect the occurrence of design bugs during on-line operation. First, we analyze the actual design bugs found and fixed in a commercial chip- multiprocessor, Sun's OpenSPARC Tl, to understand the behavior and characteristics of design bugs. Our RTL analysis of design bugs shows that the number of signals that need to be monitored to detect design bugs is significantly larger than suggested by previous studies that analyzed design bugs at a higher level using processor errata sheets. Second, based on the insights obtained from our analyses, we propose a programmable, distributed online design bug detection mechanism that incorporates the monitoring of bugs into the flip-flops of the design. The key contribution of our mechanism is its ability to monitor all control signals in the design rather than a set of signals selected at design time. As a result, it is very flexible: when a bug is discovered after the processor is shipped, it can be detected by monitoring the set of control signals that trigger the design bug. We develop an RTL prototype implementation of our mechanism on the OpenSPARC Tl chip multiprocessor. We found its area overhead to be 10% and its power consumption overhead to be 3.5% over the whole OpenSPARC Tl chip.

[1]  Timothy J. Wood The test and debug features of the AMD-K7/sup TM/ microprocessor , 1999, International Test Conference 1999. Proceedings (IEEE Cat. No.99CH37034).

[2]  A. Avizienis,et al.  Microprocessor entomology: a taxonomy of design faults in COTS microprocessors , 1999, Dependable Computing for Critical Applications 7.

[3]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[4]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[5]  Alon Gluska Coverage-oriented verification of Banias , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[6]  Francine Bacchini,et al.  Verification: what works and what doesn't , 2004, DAC '04.

[7]  Derek Feltham,et al.  Full Hold-Scan Systems in Microprocessors: Cost/Benefit Analysis , 2004 .

[8]  Josep Torrellas,et al.  Phoenix: Detecting and Recovering from Permanent Processor Design Bugs with Programmable Hardware , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[9]  Satish Narayanasamy,et al.  Patching Processor Design Errors , 2006, 2006 International Conference on Computer Design.

[10]  Jinuk Luke Shin,et al.  A Power-Efficient High-Throughput 32-Thread SPARC Processor , 2007, IEEE Journal of Solid-State Circuits.

[11]  Satish Narayanasamy,et al.  Patching Processor Design Errors with Programmable Hardware , 2007, IEEE Micro.

[12]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[13]  Todd M. Austin,et al.  Using Field-Repairable Control Logic to Correct Design Errors in Microprocessors , 2008, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.