NBTI aware workload balancing in multi-core systems

As device feature size continues to shrink, reliability becomes a severe issue due to process variation, particle-induced transient errors, and transistor wear-out/stress such as Negative Bias Temperature Instability (NBTI). Unless this problem is addressed, chip multi-processor (CMP) systems face low yields and short mean-time-to-failure (MTTF). This paper proposes a new design framework for multi-core system that includes device wear-out impact. Based on device fractional NBTI model, we propose a new NBTI aware system workload model, and develop new dynamic tile partition (DTP) algorithm to balance workload among active cores while relaxing stressed ones. Experimental results on 64 cores show that by allowing a small number of cores (around 10%)to relax in a short time period (10 second), the proposed methodology improves CMP system yield. We use the percentage of core failure to represent the yield improvement. The new strategy improves the core failure number by 20 %, and extend MTTF by 30% with little degradation in performance (less than 6%).

[1]  Hesham El-Rewini,et al.  Advanced Computer Architecture and Parallel Processing , 2005 .

[2]  Pawel Gburzynski,et al.  A scalable load balancer for forwarding internet traffic , 2005, 2005 Symposium on Architectures for Networking and Communications Systems (ANCS).

[3]  Wolfgang Rosenstiel,et al.  Fully Adaptive Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[4]  Wansheng Tang,et al.  Reliability and Mean Time to Failure of Unrepairable Systems With Fuzzy Random Lifetimes , 2007, IEEE Transactions on Fuzzy Systems.

[5]  Yu Cao,et al.  The Impact of NBTI on the Performance of Combinational and Sequential Circuits , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[6]  Yu Cao,et al.  Predictive Modeling of the NBTI Effect for Reliable Design , 2006, IEEE Custom Integrated Circuits Conference 2006.

[7]  Scott A. Mahlke,et al.  Architecting a reliable CMP switch architecture , 2007, TACO.

[8]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[9]  Sachin S. Sapatnekar,et al.  Statistical Timing Analysis Considering Spatial Correlations using a Single Pert-Like Traversal , 2003, ICCAD 2003.

[10]  Edward G. Coffman,et al.  Optimal Preemptive Scheduling on Two-Processor Systems , 1969, IEEE Transactions on Computers.

[11]  Kai Hwang,et al.  Computer architecture and parallel processing , 1984, McGraw-Hill Series in computer organization and architecture.

[12]  Hao Jiang,et al.  Source-level IP packet bursts: causes and effects , 2003, IMC '03.

[13]  Mostafa Abd-El-Barr,et al.  Advanced Computer Architecture and Parallel Processing: El-Rewini/Advanced Computer Architecture , 2004 .

[15]  Behrooz Parhami,et al.  Introduction to Parallel Processing: Algorithms and Architectures , 1999 .

[16]  Michael Orshansky,et al.  An efficient algorithm for statistical minimization of total power under timing yield constraints , 2005, Proceedings. 42nd Design Automation Conference, 2005..

[17]  T. C. Hu Parallel Sequencing and Assembly Line Problems , 1961 .