WARM: Workload-Aware Reliability Management in Linux/Android

With CMOS scaling beyond 14 nm, reliability is a major concern for IC manufacturers. Reliability-aware design has a non-negligible overhead and cannot account for user experience in mobile devices. An alternative is dynamic reliability management (DRM), which counteracts degradation by adapting the operating conditions at runtime. In this paper, for the first time we formulate DRM as an optimization problem that accounts for reliability, temperature and performance. We develop an optimal policy for multicores using convex optimization, and show that it is not feasible to implement on real systems. For this reason, we propose workload-aware reliability management (WARM), a fast DRM technique adapting to diverse workload requirements to trade reliability and user experience. WARM is implemented and tested on a real Android device. WARM approximates the solution of the convex solver within 5% on average, while executing more than $400 {\times }$ faster. WARM integrates a thermal controller that allocates tasks to meet thermal constraints. This is required since degradation strongly depends on temperature. We show that WARM meets temperature constraints within 5% in 87.5% more cases than the state-of-the-art. We show that WARM task allocation achieves up to one year lifetime improvement for a multicore platform. It can achieve up to 100% of performance improvement on cluster architectures, such as big.LITTLE, while still guaranteeing the reliability target. Finally, we show that it achieves performance in the 4% of the maximum for a broad range of a applications, while meeting the reliability constraints.

[1]  Geoff V. Merrett,et al.  Adaptive and Hierarchical Runtime Manager for Energy-Aware Thermal Management of Embedded Systems , 2016, ACM Trans. Embed. Comput. Syst..

[2]  David Blaauw,et al.  Reliability modeling and management in dynamic microprocessor-based systems , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[3]  Bharadwaj Veeravalli,et al.  Reliability and Energy-Aware Mapping and Scheduling of Multimedia Applications on Multiprocessor Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[4]  Sarma B. K. Vrudhula,et al.  Reliability-aware thermal management for hard real-time applications on multi-core processors , 2011, 2011 Design, Automation & Test in Europe.

[5]  C.H. van Berkel,et al.  Multi-core for mobile phones , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[6]  Tajana Simunic,et al.  BLAST: Battery Lifetime-constrained Adaptation with Selected Target in Mobile Devices , 2015, EAI Endorsed Trans. Energy Web.

[7]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[8]  M. Mendicino,et al.  Realistic Projections of Product Fails from NBTI and TDDB , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[9]  Shubham Kamdar,et al.  big. LITTLE Architecture: Heterogeneous Multicore Processing , 2015 .

[10]  Luca Benini,et al.  Optimum: Thermal-aware task allocation for heterogeneous many-core devices , 2014, 2014 International Conference on High Performance Computing & Simulation (HPCS).

[11]  Shuguang Feng,et al.  Self-calibrating Online Wearout Detection , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[12]  Kevin Skadron,et al.  Temperature-aware microarchitecture: Modeling and implementation , 2004, TACO.

[13]  J. Stathis Physical and predictive models of ultra thin oxide reliability in CMOS devices and circuits , 2001, 2001 IEEE International Reliability Physics Symposium Proceedings. 39th Annual (Cat. No.00CH37167).

[14]  David Blaauw,et al.  Process Variation and Temperature-Aware Full Chip Oxide Breakdown Reliability Analysis , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[15]  Luca Benini,et al.  Thermal and Energy Management of High-Performance Multicores: Distributed and Self-Calibrating Model-Predictive Controller , 2013, IEEE Transactions on Parallel and Distributed Systems.

[16]  Keith A. Bowman,et al.  Impact of die-to-die and within-die parameter variations on the throughput distribution of multi-core processors , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[17]  Luca Benini,et al.  An Effective Gray-Box Identification Procedure for Multicore Thermal Modeling , 2014, IEEE Transactions on Computers.

[18]  Tulika Mitra,et al.  Lifetime Reliability Aware Architectural Adaptation , 2013, 2013 26th International Conference on VLSI Design and 2013 12th International Conference on Embedded Systems.

[19]  Jordi Suñé,et al.  Interplay of voltage and temperature acceleration of oxide breakdown for ultra-thin gate oxides , 2002 .

[20]  New characterization and modeling approach for NBTI degradation from transistor to product level , 2007, 2007 IEEE International Electron Devices Meeting.

[21]  Venkatesh Pallipadi,et al.  The Ondemand Governor Past, Present, and Future , 2010 .

[22]  Tajana Simunic,et al.  Ambient variation-tolerant and inter components aware thermal management for mobile system on chips , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[23]  T. Nigam,et al.  Temperature acceleration of oxide breakdown and its impact on ultra-thin gate oxide reliability , 1999, 1999 Symposium on VLSI Technology. Digest of Technical Papers (IEEE Cat. No.99CH36325).

[24]  David Blaauw,et al.  Multi-Mechanism Reliability Modeling and Management in Dynamic Systems , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[25]  David Blaauw,et al.  Process variation and temperature-aware reliability management , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[26]  Mark Mohammad Tehranipoor,et al.  Aging Adaption in Integrated Circuits Using a Novel Built-In Sensor , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[27]  Luca Benini,et al.  A Linux-governor based Dynamic Reliability Manager for android mobile devices , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[28]  Philippe Maurine,et al.  Embedding statistical tests for on-chip dynamic voltage and temperature monitoring , 2012, DAC Design Automation Conference 2012.

[29]  Rami Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, ICCAD 2004.

[30]  Ümit Y. Ogras,et al.  Predictive dynamic thermal and power management for heterogeneous mobile platforms , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[31]  Bharadwaj Veeravalli,et al.  Run-time mapping for reliable many-cores based on energy/performance trade-offs , 2013, 2013 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS).

[32]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[33]  Lara Dolecek,et al.  Underdesigned and Opportunistic Computing in Presence of Hardware Variability , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[34]  Tajana Simunic,et al.  Smartphone analysis and optimization based on user activity recognition , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[35]  David Blaauw,et al.  Compact Degradation Sensors for Monitoring NBTI and Oxide Degradation , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[36]  Qiang Xu,et al.  On Task Allocation and Scheduling for Lifetime Extension of Platform-Based MPSoC Designs , 2011, IEEE Transactions on Parallel and Distributed Systems.

[37]  Yun Zhang,et al.  Revisiting the Sequential Programming Model for Multi-Core , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[38]  David Blaauw,et al.  Dynamic NBTI Management Using a 45 nm Multi-Degradation Sensor , 2011, IEEE Trans. Circuits Syst. I Regul. Pap..

[39]  Chenming Hu Gate oxide scaling limits and projection , 1996, International Electron Devices Meeting. Technical Digest.

[40]  Michael Taylor A landscape of the new dark silicon design regime , 2013 .

[41]  Luca Benini,et al.  Aging-Aware Energy-Efficient Workload Allocation for Mobile Multimedia Platforms , 2013, IEEE Transactions on Parallel and Distributed Systems.

[42]  Sudhakar Yalamanchili,et al.  Managing performance-reliability tradeoffs in multicore processors , 2015, 2015 IEEE International Reliability Physics Symposium.

[43]  Luca Benini,et al.  Workload and user experience-aware Dynamic Reliability Management in multicore processors , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[44]  Dakai Zhu,et al.  Reliability-Aware Energy Management for Periodic Real-Time Tasks , 2009, IEEE Trans. Computers.

[45]  Stephen P. Boyd,et al.  Graph Implementations for Nonsmooth Convex Programs , 2008, Recent Advances in Learning and Control.