Invited: Cross-layer modeling and optimization for electromigration induced reliability

In this paper, we propose a new approach for cross-layer electromigration (EM) induced reliability modeling and optimization at physics, system and datacenter levels. We consider a recently proposed physics-based electromigration (EM) reliability model to predict the EM reliability of full-chip power grid networks for long-term failures. We show how the new physics-based dynamic EM model at the physics level can be abstracted at the system level and even at the datacenter level. Our datacenter system-level power model is based on the BigHouse simulator. To speed up the online optimization for energy in a datacenter, we propose a new combined datacenter power and reliability compact model using a learning based approach in which a feed-forward neural network (FNN) is trained to predict energy and long term reliability for each processor under datacenter scheduling and workloads. To optimize the energy and reliability of a datacenter, we apply the efficient adaptive Q-learning based reinforcement learning method. Experimental results show that the proposed compact models for the datacenter system trained with different workloads under different cluster power modes and scheduling policies are able to build accurate energy and lifetime. Moreover, the proposed optimization method effectively manages and optimizes datacenter energy subject to reliability, given power budget and performance.

[1]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[3]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[4]  Junjie Wu,et al.  BigHouse: A simulation infrastructure for data center systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[5]  Farid N. Najm,et al.  Redundancy-aware Electromigration checking for mesh power grids , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[6]  Michael T. Heath,et al.  Scientific Computing: An Introductory Survey , 1996 .

[7]  Bharadwaj Veeravalli,et al.  Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[9]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[10]  J. Black,et al.  Electromigration—A brief survey and some recent results , 1969 .

[11]  Thomas F. Wenisch,et al.  Power management of online data-intensive services , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[12]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[13]  M. Korhonen,et al.  Stress evolution due to electromigration in confined metal lines , 1993 .

[14]  Sheldon X.-D. Tan,et al.  Physics-based electromigration assessment for power grid networks , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[15]  Frederic T. Chong,et al.  Fighting fire with fire: Modeling the datacenter-scale effects of targeted superlattice thermal management , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[16]  Jian-Jia Chen,et al.  Thermal-aware lifetime reliability in multicore systems , 2010, 2010 11th International Symposium on Quality Electronic Design (ISQED).

[17]  Nick McKeown,et al.  pFabric: minimal near-optimal datacenter transport , 2013, SIGCOMM.

[18]  Kevin Skadron,et al.  Interconnect lifetime prediction under dynamic stress for reliability-aware design , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[19]  Sudhakar Yalamanchili,et al.  Architectural Reliability: Lifetime Reliability Characterization and Management ofMany-Core Processors , 2015, IEEE Computer Architecture Letters.

[20]  Daniel Wong,et al.  Implications of high energy proportional servers on cluster-wide energy proportionality , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[21]  V. Sukharev Beyond Black’s equation: Full-chip EM/SM assessment in 3D IC stack , 2014 .

[22]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[23]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .