Leveraging 3D Technology for Improved Reliability

Aggressive technology scaling over the years has helped improve processor performance but has caused a reduction in processor reliability. Shrinking transistor sizes and lower supply voltages have increased the vulnerability of computer systems towards transient faults. An increase in within-die and die-to-die parameter variations has also led to a greater number of dynamic timing errors. A potential solution to mitigate the impact of such errors is redundancy via an in-order checker processor. Emerging 3D chip technology promises increased processor performance as well as reduced power consumption because of shorter on-chip wires. In this paper, we leverage the "snap-on" functionality provided by 3D integration and propose implementing the redundant checker processor on a second die. This allows manufacturers to easily create a family of "reliable processors" without significantly impacting the cost or performance for customers that care less about reliability. We comprehensively evaluate design choices for this second die, including the effects of L2 cache organization, deep pipelining, and frequency. An interesting feature made possible by 3D integration is the incorporation of heterogeneous process technologies within a single chip. We evaluate the possibility of providing redundancy with an older process technology, an unexplored and especially compelling application of die heterogeneity. We show that with the most pessimistic assumptions, the overhead of the second die can be as high as either a 7degC temperature increase or a 8% performance loss. However, with the use of an older process, this overhead can be reduced to a 3degC temperature increase or a 4% performance loss, while also providing higher error resilience.

[1]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[2]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[3]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[4]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[5]  Bernardo A. Huberman,et al.  Predicting the Future , 2003, Inf. Syst. Frontiers.

[6]  C. Nicopoulos,et al.  Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, ISCA 2006.

[7]  Kieran Mansley,et al.  Engineering a user-level TCP for the CLAN network , 2003, NICELI '03.

[8]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[9]  Babak Falsafi,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, MICRO.

[10]  Jason Cong,et al.  An automated design flow for 3D microarchitecture evaluation , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[11]  Sharad Malik,et al.  Orion: a power-performance simulator for interconnection networks , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[12]  Nisha Checka,et al.  Technology, performance, and computer-aided design of three-dimensional integrated circuits , 2004, ISPD '04.

[13]  Jeffrey G. Gray,et al.  Efficient direct user level sockets for an Intel/spl reg/ Xeon/spl trade/ processor based TCP on-load engine , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[14]  Martin Burtscher,et al.  Bridging the processor-memory performance gap with 3D IC technology , 2005, IEEE Design & Test of Computers.

[15]  Rajeev Balasubramonian,et al.  Interconnect design considerations for large NUCA caches , 2007, ISCA '07.

[16]  Kaustav Banerjee,et al.  Introspective 3D chips , 2006, ASPLOS XII.

[17]  John Nagle,et al.  Congestion control in IP/TCP internetworks , 1984, CCRV.

[18]  Srihari Makineni,et al.  Exploring the cache design space for large scale CMPs , 2005, CARN.

[19]  Rohit Bhatia,et al.  Montecito: a dual-core, dual-thread Itanium processor , 2005, IEEE Micro.

[20]  P. Reed,et al.  Design aspects of a microprocessor data cache using 3D die interconnect technology , 2005, 2005 International Conference on Integrated Circuit Design and Technology, 2005. ICICDT 2005..

[21]  Rajeev Balasubramonian,et al.  Power Efficient Approaches to Redundant Multithreading , 2007, IEEE Transactions on Parallel and Distributed Systems.

[22]  Karthik Ramani,et al.  Interconnect-Aware Coherence Protocols for Chip Multiprocessors , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[23]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[24]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[25]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[26]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[27]  Kaustav Banerjee,et al.  A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[28]  Gabriel H. Loh,et al.  Implementing caches in a 3D technology for high performance processors , 2005, 2005 International Conference on Computer Design.

[29]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[30]  MuralimanoharNaveen,et al.  Interconnect design considerations for large NUCA caches , 2007 .

[31]  Norman P. Jouppi,et al.  Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[32]  Yuan Xie,et al.  Design space exploration for 3D architectures , 2006, JETC.

[33]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[34]  Michael L. Scott,et al.  Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling , 2002, Proceedings Eighth International Symposium on High Performance Computer Architecture.

[35]  B. Narasimham,et al.  Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[36]  Pradip Bose,et al.  Optimizing pipelines for power and performance , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[37]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[38]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[39]  Lei Jiang,et al.  Die Stacking (3D) Microarchitecture , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[40]  Patrick James,et al.  Predicting the Future of the FTAA , 2000 .

[41]  Gabriel H. Loh,et al.  Thermal analysis of a 3D die-stacked high-performance microprocessor , 2006, GLSVLSI '06.

[42]  Greg J. Regnier,et al.  TCP onloading for data center servers , 2004, Computer.