The RECIPE approach to challenges in deeply heterogeneous high performance systems

Abstract RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case. This manuscript has been submitted to the Microprocessors and Microsystems Special Issue on European Projects in Embedded Systems Design

[1]  David Atienza,et al.  Dynamic Thermal Management with Proactive Fan Speed Control Through Reinforcement Learning , 2020, 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Alessandro Cilardo,et al.  The MANGO FET-HPC Project: An Overview , 2015, 2015 IEEE 18th International Conference on Computational Science and Engineering.

[3]  Giuseppe Massari,et al.  The Misconception of Exponential Tail Upper-Bounding in Probabilistic Real Time , 2019, IEEE Embedded Systems Letters.

[4]  Emery D. Berger,et al.  STABILIZER: statistically sound performance evaluation , 2013, ASPLOS '13.

[5]  Giuseppe Massari,et al.  chronovise: Measurement-Based Probabilistic Timing Analysis framework , 2018, J. Open Source Softw..

[6]  Giovanni Agosta,et al.  Predictive Resource Management for Next-Generation High-Performance Computing Heterogeneous Platforms , 2019, SAMOS.

[7]  John M. Emmert,et al.  A survey of fault tolerant methodologies for FPGAs , 2006, TODE.

[8]  Alessandro Cilardo,et al.  Enabling HPC for QoS-sensitive applications: The MANGO approach , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Giuseppe Massari,et al.  Probabilistic-WCET Reliability: On the experimental validation of EVT hypotheses , 2019, COINS.

[10]  Qiang Xu,et al.  Characterizing the lifetime reliability of manycore processors with core-level redundancy , 2010, 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[11]  Radu Marculescu,et al.  FARM: Fault-aware resource management in NoC-based multiprocessor platforms , 2011, 2011 Design, Automation & Test in Europe.

[12]  Luca Benini,et al.  WARM: Workload-Aware Reliability Management in Linux/Android , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[13]  Edoardo Fusella,et al.  Exploring manycore architectures for next-generation HPC systems through the MANGO approach , 2018, Microprocess. Microsystems.

[14]  Liliana Cucu-Grosjean,et al.  Measurement-Based Probabilistic Timing Analysis for Multi-path Programs , 2012, 2012 24th Euromicro Conference on Real-Time Systems.

[15]  Keith D. Underwood,et al.  Intel® Omni-path Architecture: Enabling Scalable, High Performance Fabrics , 2015, 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects.

[16]  Hannu Tenhunen,et al.  A lifetime-aware runtime mapping approach for many-core systems in the dark silicon era , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[17]  Francisco J. Cazorla,et al.  Probabilistic timing analysis on conventional cache designs , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Francisco J. Cazorla,et al.  On the Use of Probabilistic Worst-Case Execution Time Estimation for Parallel Applications in High Performance Systems , 2020 .

[19]  Mauricio Hanzich,et al.  Toward an automatic full-wave inversion: Synthetic study cases , 2016 .

[20]  Giuseppe Massari,et al.  Effective Runtime Resource Management Using Linux Control Groups with the BarbequeRTRM Framework , 2015, TECS.

[21]  William Fornaciari,et al.  Modeling DVFS and Power-Gating Actuators for Cycle-Accurate NoC-Based Simulators , 2015, ACM J. Emerg. Technol. Comput. Syst..

[22]  Partha Pratim Pande,et al.  Hardware accelerators for biocomputing: A survey , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[23]  Giovanni Agosta,et al.  Optimizing Memory Management in Deeply Heterogeneous HPC Accelerators , 2017, 2017 46th International Conference on Parallel Processing Workshops (ICPPW).

[24]  Giovanni Agosta,et al.  Managing Heterogeneous Resources in HPC Systems , 2018, PARMA-DITAM '18.

[25]  Nico Struckmann,et al.  Towards an Environment to Deliver High Performance Computing to Small and Medium Enterprises , 2015 .

[26]  Mauricio Hanzich,et al.  Developing Full Waveform Inversion Using HPC Frameworks: BSIT , 2014, HiPC 2014.

[27]  William Fornaciari,et al.  All-Digital Energy-Constrained Controller for General-Purpose Accelerators and CPUs , 2020, IEEE Embedded Systems Letters.

[28]  Giuseppe Massari,et al.  Back to the future: resource management in post-cloud solutions , 2018, INTESA@ESWEEK.

[29]  Alessandro Cilardo,et al.  Reliable power and time-constraints-aware predictive management of heterogeneous exascale systems , 2018, SAMOS.

[30]  David Atienza,et al.  Gem5-X: A Gem5-Based System Level Simulation Framework to Optimize Many-Core Platforms , 2019, 2019 Spring Simulation Conference (SpringSim).

[31]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[32]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[33]  Matthew Parris,et al.  Progress in autonomous fault recovery of field programmable gate arrays , 2011, CSUR.

[34]  David Atienza,et al.  3D-ICE: Fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling , 2010, 2010 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[35]  Gregory F. Pfister,et al.  Aspects of the InfiniBand architecture , 2001, Proceedings 42nd IEEE Symposium on Foundations of Computer Science.

[36]  Alberto Leva,et al.  An Open-Hardware Platform for MPSoC Thermal Modeling , 2019, SAMOS.

[37]  Alessandro Cilardo,et al.  Interplay of loop unrolling and multidimensional memory partitioning in HLS , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[38]  Francisco J. Cazorla,et al.  Measurement-Based Worst-Case Execution Time Estimation Using the Coefficient of Variation , 2017, ACM Trans. Design Autom. Electr. Syst..