Explaining cache SER anomaly using DUE AVF measurement

We have discovered that processors can experience a super-linear increase in detected unrecoverable errors (DUE) when the write-back L2 cache is doubled in size. This paper explains how an increase in the cache tag's Architectural Vulnerability Factor or AVF caused such a super-linear increase in the DUE rate. AVF expresses the fraction of faults that become user-visible errors. Our hypothesis is that this increase in AVF is caused by a super-linear increase in “dirty” data residence times in the L2 cache. Using proton beam irradiation, we measured the DUE rates from the write-back cache tags and analyzed the data to show that our hypothesis holds. We utilized a combination of simulation and measurements to help develop and prove this hypothesis. Our investigation reveals two methods by which dirty line residency causes super-linear increases in the L2 cache tag's AVF. One is a reduction in the miss rates as we move to the larger cache part, resulting in fewer evictions of data required for architecturally correct execution. The second is the occurrence of strided cache access patterns, which cause a significant increase in the “dirty” residency times of cache lines without increasing the cache miss rate.

[1]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor , 2008, IBM J. Res. Dev..

[2]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[3]  Sudhakar M. Reddy,et al.  Cache size selection for performance, energy and reliability of time-constrained systems , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[4]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[5]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[6]  B. Jacob,et al.  CMP $ im : A Pin-Based OnThe-Fly Multi-Core Cache Simulator , 2008 .

[7]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[8]  N. Seifert,et al.  Multi-cell upset probabilities of 45nm high-k + metal gate SRAM devices in terrestrial and space environments , 2008, 2008 IEEE International Reliability Physics Symposium.

[9]  Xiaodong Li,et al.  Online Estimation of Architectural Vulnerability Factor for Soft Errors , 2008, 2008 International Symposium on Computer Architecture.

[10]  J. Platt Strong Inference , 2007 .

[11]  B. Narasimham,et al.  Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[12]  E. Normand Extensions of the burst generation rate method for wider application to proton/neutron-induced single event effects , 1998 .

[13]  Harish Patil,et al.  Profile-guided post-link stride prefetching , 2002, ICS '02.

[14]  D. M. Hiemstra,et al.  Single event upset characterization of the Pentium(R) MMX and Pentium(R) II microprocessors using proton irradiation , 1999 .

[15]  Rajiv Kapoor,et al.  Pinpointing Representative Portions of Large Intel® Itanium® Programs with Dynamic Instrumentation , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[16]  N. Seifert,et al.  Chip-level soft error estimation method , 2005, IEEE Transactions on Device and Materials Reliability.