Lifetime improvement through runtime wear-based task mapping

As transistors continue to become smaller, they become exponentially susceptible to permanent wearout faults. Without mitigation, these types of faults will render systems useless within unacceptably short time periods. Our work presents the design for a runtime task mapping subsystem which mitigates these faults using a wear-based heuristic. We compare our wear-based heuristic to power- and temperature-based heuristics used within the same system framework. Using a wide range of synthetic and real-world benchmarks, we show that our wear-based heuristic is able to improve total system lifetime by an average of 7.1% over temperature-based heuristics. Additionally, we show that our wear-based heuristic can be used to drastically improve the time to the first component failure (TTFF) of a system. TTFF is a metric that is of interest to designers who wish to avoid the design and verification difficulties of systems which are expected to recover after a component failure. Our wear-based heuristic improves TTFF by an average of 14.6% over temperature-based heuristics across all of our benchmarks. Our observations lead us to conclude that runtime, wear-based task mapping must be incorporated into systems for which lifetime is a primary design goal.

[1]  Igor L. Markov,et al.  Practical slicing and non-slicing block-packing without simulated annealing , 2004, GLSVLSI '04.

[2]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[3]  Li Shang,et al.  Application-Specific MPSoC Reliability Optimization , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[4]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[5]  Babak Falsafi,et al.  Detecting Emerging Wearout Faults , 2007 .

[6]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[7]  Donald E. Thomas,et al.  A case for lifetime-aware task mapping in embedded chip multiprocessors , 2010, 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[8]  Linda S. Milor,et al.  Analysis and On-Chip Monitoring of Gate Oxide Breakdown in SRAM Cells , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9]  Qiang Xu,et al.  Lifetime reliability-aware task allocation and scheduling for MPSoC platforms , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[10]  Donald E. Thomas,et al.  Cost-effective slack allocation for lifetime improvement in NoC-based MPSoCs , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[11]  Wayne H. Wolf,et al.  TGFF: task graphs for free , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[12]  Tajana Simunic,et al.  Temperature Aware Task Scheduling in MPSoCs , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[13]  J. Ticehurst Cacti , 1983 .

[14]  Sharad Malik,et al.  A power model for routers: modeling Alpha 21364 and InfiniBand routers , 2002, Proceedings 10th Symposium on High Performance Interconnects.

[15]  Pradip Bose,et al.  Exploiting structural duplication for lifetime reliability enhancement , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[16]  Waleed Dweik,et al.  WearMon: Reliability monitoring using adaptive critical path testing , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[17]  Li Shang,et al.  Reliable multiprocessor system-on-chip synthesis , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).