The Interplay Between Energy Efficiency and Resilience for Scalable High Performance Computing Systems

As the exascale supercomputers are expected to embark around 2020, supercomputers nowadays expand rapidly in size and duration in use, which brings demanding requirements of energy efficiency and resilience. These requirements are becoming prevalent and challenging, considering the crucial facts that: (a) The costs of powering a supercomputer grow greatly together with its expanding scale, and (b) failure rates of large-scale High Performance Computing (HPC) systems are dramatically shortened due to a large amount of compute nodes interconnected as a whole. It is thus desirable to consider both crucial dimensions for building scalable, cost-efficient, and robust HPC systems in this era. Specifically, our goal is to fulfill the optimal performance-power-failure ratio while exploiting parallelism during HPC runs.Within a wide range of HPC applications, numerical linear algebra matrix operations including matrix multiplication, Cholesky, LU, and QR factorizations are fundamental and have been extensively used for science and engineering fields. For some scientific applications, these matrix operations are the core component and dominate the total execution time. Saving energy for the matrix operations thus significantly contributes to the energy efficiency of scientific computing nowadays. Typically, when processors are experiencing idle time during HPC runs, i.e., slack, energy savings can be achieved by leveraging techniques to appropriately scale down processor frequency and voltage during underused execution phases. Although with high generality, existing OS level energy efficient solutions can effectively save energy for some applications in a black-box fashion, they are however defective for applications with variable workloads such as the matrix operations – the optimal energy savings cannot be achieved due to potentially inaccurate and high-cost workload prediction they rely on. Therefore, we propose to utilize algorithmic characteristics of the matrix operations to maximize potential energy savings. Specifically, we achieve the maximum of energy savings in two ways: (a) reducing the overhead of processor frequency switches during the slack, and (b) accurately predicting slack of processors via algorithm-based slack prediction, and eliminating the slack accordingly by respecting the critical path of an HPC run.While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between them for HPC systems. We propose to quantitatively analyze the trade-offs between energy efficiency and resilience in the large-scale HPC environment. Firstly, we observe that existing energy saving solutions via slack reclamation are essentially frequency-directed, and thus fail to fully exploit more energy saving opportunities. In our approach, we decrease the supply voltage associated with a given operating frequency for processors to further reduce power consumption at the cost of increased failure rates. We leverage the mainstream resilience techniques to tolerate the increased failures caused by the undervolting technique. Our strategy is theoretically validated and empirically evaluated to save more energy than a state-of-the-art frequency-directed energy saving solution, with the guarantee of correctness. Secondly, for capturing the impacts of frequency-directed solutions and undervolting, we also develop analytic models that investigate the trade-offs among resilience, energy efficiency, and scalability for large-scale HPC systems. We discuss various HPC parameters that inherently affect each other, and also determine the optimal energy savings at scale, in terms of the number of floating-point operations per Watt, in the presence of undervolting and fault tolerance.

[1]  Yifeng Guo,et al.  Generalized Standby-Sparing techniques for energy-efficient fault tolerance in multiprocessor real-time systems , 2013, 2013 IEEE 19th International Conference on Embedded and Real-Time Computing Systems and Applications.

[2]  J. Duell The design and implementation of Berkeley Lab's linux checkpoint/restart , 2005 .

[3]  Thomas Hérault,et al.  Algorithm-based fault tolerance for dense matrix factorizations , 2012, PPoPP '12.

[4]  Mitsuhisa Sato,et al.  Profile-based optimization of power performance by using dynamic voltage scaling on a PC cluster , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[5]  Vipin Kumar,et al.  Isoefficiency: measuring the scalability of parallel algorithms and architectures , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[6]  Zizhong Chen,et al.  Performance of MPI broadcast algorithms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  Robert A. van de Geijn,et al.  Collective communication on architectures that support simultaneous communication over multiple links , 2006, PPoPP '06.

[8]  Shuaiwen Song,et al.  Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[9]  John A. Gunnels,et al.  Extending stability beyond CPU millennium: a micron-scale atomistic simulation of Kelvin-Helmholtz instability , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[10]  Jack Dongarra,et al.  Distibuted Dense Numerical Linear Algebra Algorithms on Massively Parallel Architectures: DPLASMA , 2011 .

[11]  Dong Li,et al.  Quantitatively Modeling Application Resilience with the Data Vulnerability Factor , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Zizhong Chen,et al.  Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods , 2013, PPoPP '13.

[13]  Zizhong Chen,et al.  A survey of power and energy efficient techniques for high performance numerical linear algebra operations , 2014, Parallel Comput..

[14]  H. Mair,et al.  A 65-nm Mobile Multimedia Applications Processor with an Adaptive Power Management Scheme to Compensate for Variations , 2007, 2007 IEEE Symposium on VLSI Circuits.

[15]  Laxmikant V. Kalé,et al.  Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[16]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[17]  Frank Mueller,et al.  ScalaBenchGen: Auto-Generation of Communication Benchmarks Traces , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[18]  Li Tan,et al.  Optimizing Energy Efficiency for Distributed Dense Matrix Factorizations via Utilizing Algorithmic Characteristics , 2014 .

[19]  Martin Schulz,et al.  Bounding energy consumption in large-scale MPI programs , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[20]  Torsten Hoefler,et al.  A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[21]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[22]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[23]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[24]  Mahmut T. Kandemir,et al.  Exploiting barriers to optimize power consumption of CMPs , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[25]  Rami G. Melhem,et al.  The effects of energy management on reliability in real-time embedded systems , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..

[26]  Haoqiang Jin,et al.  Performance characteristics of the multi-zone NAS parallel benchmarks , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[27]  Hsien-Hsin S. Lee,et al.  Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.

[28]  Jian Li,et al.  Power-efficient time-sensitive mapping in heterogeneous systems , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[29]  Chris Fallin,et al.  Memory power management via dynamic voltage/frequency scaling , 2011, ICAC '11.

[30]  Thomas Rauber,et al.  Automatic Tuning of PDGEMM Towards Optimal Performance , 2005, Euro-Par.

[31]  Zizhong Chen,et al.  Slow Down or Halt: Saving the Optimal Energy for Scalable HPC Systems , 2015, ICPE.

[32]  Rong Ge,et al.  Performance-constrained Distributed DVS Scheduling for Scientific Applications on Power-aware Clusters , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[33]  Wei Wu,et al.  Reducing cache power with low-cost, multi-bit error-correcting codes , 2010, ISCA.

[34]  Jian Li,et al.  Power-performance considerations of parallel computing on chip multiprocessors , 2005, TACO.

[35]  Franck Cappello,et al.  ECOFIT: A Framework to Estimate Energy Consumption of Fault Tolerance Protocols for HPC Applications , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[36]  Ulrich Kremer,et al.  The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction , 2003, PLDI '03.

[37]  Mateo Valero,et al.  Understanding the future of energy-performance trade-off via DVFS in HPC environments , 2012, J. Parallel Distributed Comput..

[38]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[39]  Bronis R. de Supinski,et al.  Adagio: making DVS practical for complex HPC applications , 2009, ICS.

[40]  Qingyuan Deng,et al.  MemScale: active low-power modes for main memory , 2011, ASPLOS XVI.

[41]  Ragunathan Rajkumar,et al.  Critical power slope: understanding the runtime effects of frequency scaling , 2002, ICS '02.

[42]  Mahmut T. Kandemir,et al.  Reducing energy consumption of parallel sparse matrix applications through integrated link/CPU voltage scaling , 2007, The Journal of Supercomputing.

[43]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[44]  Shuaiwen Song,et al.  Iso-Energy-Efficiency: An Approach to Power-Constrained Parallel Computation , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[45]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[46]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[47]  William Harrod A journey to exascale computing , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[48]  Rolf Riesen,et al.  Evaluating energy savings for checkpoint/restart , 2013, E2SC '13.

[49]  Enrique S. Quintana-Ortí,et al.  Modeling power and energy of the task-parallel Cholesky factorization on multicore processors , 2012, Computer Science - Research and Development.

[50]  Dong Li,et al.  PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications , 2010, IEEE Transactions on Parallel and Distributed Systems.

[51]  Onur Mutlu,et al.  Accelerating critical section execution with asymmetric multi-core architectures , 2009, ASPLOS.

[52]  Manoj Sachdev,et al.  Efficient adaptive voltage scaling system through on-chip critical path emulation , 2004, Proceedings of the 2004 International Symposium on Low Power Electronics and Design (IEEE Cat. No.04TH8758).

[53]  Radu Teodorescu,et al.  Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors , 2013, ISCA.

[54]  Scott Shenker,et al.  Scheduling for reduced CPU energy , 1994, OSDI '94.

[55]  Wei Wang,et al.  A continuous, analytic drain-current model for DG MOSFETs , 2004, IEEE Electron Device Letters.

[56]  Wayne Luk,et al.  Dynamic scheduling Monte-Carlo framework for multi-accelerator heterogeneous clusters , 2010, 2010 International Conference on Field-Programmable Technology.

[57]  Jd Hogg,et al.  A DAG-based parallel Cholesky factorization for multicore systems , 2008 .

[58]  Alan H. Karp,et al.  Measuring parallel processor performance , 1990, CACM.

[59]  James Demmel,et al.  Improving communication performance in dense linear algebra via topology aware collectives , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[60]  Jung Ho Ahn,et al.  MAGE: Adaptive Granularity and ECC for resilient and power efficient memory systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[61]  Jaeyoung Choi,et al.  Design and Implementation of the ScaLAPACK LU, QR, and Cholesky Factorization Routines , 1994, Sci. Program..

[62]  Mitsuhisa Sato,et al.  Emprical study on Reducing Energy of Parallel Programs using Slack Reclamation by DVFS in a Power-scalable High Performance Cluster , 2006, 2006 IEEE International Conference on Cluster Computing.

[63]  Rong Ge,et al.  Energy Efficient Parallel Matrix-Matrix Multiplication for DVFS-enabled Clusters , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[64]  Xin Yuan,et al.  CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.

[65]  Rong Ge,et al.  Power-Aware Speedup , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[66]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[67]  Jaeyoung Choi A new parallel matrix multiplication algorithm on distributed-memory concurrent computers , 1998, Concurr. Pract. Exp..

[68]  Shuaiwen Song,et al.  Scalable Energy Efficiency with Resilience for High Performance Computing Systems , 2016, ACM Trans. Archit. Code Optim..

[69]  Hiroto Yasuura,et al.  Voltage scheduling problem for dynamically variable voltage processors , 1998, Proceedings. 1998 International Symposium on Low Power Electronics and Design (IEEE Cat. No.98TH8379).

[70]  Zizhong Chen,et al.  FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines , 2014, HPDC '14.

[71]  Efraim Rotem,et al.  Energy Aware Race to Halt: A Down to EARtH Approach for Platform Energy Management , 2014, IEEE Computer Architecture Letters.

[72]  Dhabaleswar K. Panda,et al.  Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[73]  David K. Lowenthal,et al.  Using multiple energy gears in MPI programs on a power-scalable cluster , 2005, PPoPP.

[74]  David K. Lowenthal,et al.  Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[75]  Mahmut T. Kandemir,et al.  Reducing power with performance constraints for parallel sparse applications , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[76]  Rami G. Melhem,et al.  Energy-aware checkpointing of divisible tasks with soft or hard deadlines , 2013, 2013 International Green Computing Conference Proceedings.

[77]  Dong Li,et al.  Hybrid MPI/OpenMP power-aware computing , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[78]  Sharad Malik,et al.  EPROF: An energy/performance/reliability optimization framework for streaming applications , 2012, 17th Asia and South Pacific Design Automation Conference.

[79]  Michael C. Huang,et al.  The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[80]  Qian Zhu,et al.  Power-Aware Consolidation of Scientific Workflows in Virtualized Environments , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[81]  Dong Li,et al.  A2E: Adaptively aggressive energy efficient DVFS scheduling for data intensive applications , 2013, 2013 IEEE 32nd International Performance Computing and Communications Conference (IPCCC).

[82]  Massoud Pedram,et al.  Fine-grained dynamic voltage and frequency scaling for precise energy and performance trade-off based on the ratio of off-chip access to on-chip computation times , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[83]  Zhiling Lan,et al.  Reliability-aware scalability models for high performance computing , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[84]  Charles E. Leiserson,et al.  On-the-Fly Pipeline Parallelism , 2015, ACM Trans. Parallel Comput..

[85]  Dakai Zhu,et al.  Energy Management for Real-Time Embedded Systems with Reliability Requirements , 2006, 2006 IEEE/ACM International Conference on Computer Aided Design.

[86]  Antonia Zhai,et al.  Energy efficient speculative threads: Dynamic thread allocation in same-ISA heterogeneous multicore systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[87]  Enrique S. Quintana-Ortí,et al.  Reducing Energy Consumption of Dense Linear Algebra Operations on Hybrid CPU-GPU Platforms , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[88]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[89]  Lieven Eeckhout,et al.  SWEEP: evaluating computer system energy efficiency using synthetic workloads , 2011, HiPEAC.

[90]  David K. Lowenthal,et al.  Minimizing execution time in MPI programs on an energy-constrained, power-scalable cluster , 2006, PPoPP '06.

[91]  Jonathan Chang,et al.  A 45 nm 8-Core Enterprise Xeon¯ Processor , 2009, IEEE Journal of Solid-State Circuits.

[92]  Dong Li,et al.  Strategies for Energy-Efficient Resource Management of Hybrid Programming Models , 2013, IEEE Transactions on Parallel and Distributed Systems.

[93]  Rajiv Gupta,et al.  Lightweight fault detection in parallelized programs , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[94]  Andrew S. Cassidy,et al.  Beyond Amdahl's Law: An Objective Function That Links Multiprocessor Performance Gains to Delay and Energy , 2012, IEEE Transactions on Computers.

[95]  Hui Liu,et al.  Optimizing Process-to-Core Mappings for Two Dimensional Broadcast/Reduce on Multicore Architectures , 2011, 2011 International Conference on Parallel Processing.

[96]  Kuo-Chi Lin,et al.  An incremental genetic algorithm approach to multiprocessor scheduling , 2004, IEEE Transactions on Parallel and Distributed Systems.

[97]  D.K. Lowenthal,et al.  Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[98]  Dong Li,et al.  HP-DAEMON: High Performance Distributed Adaptive Energy-efficient Matrix-multiplicatiON , 2014, ICCS.

[99]  Rafael Mayo,et al.  Analysis of Strategies to Save Energy for Message-Passing Dense Linear Algebra Kernels , 2012, 2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[100]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[101]  Xue Liu,et al.  Power-Aware CPU Utilization Control for Distributed Real-Time Systems , 2009, 2009 15th IEEE Real-Time and Embedded Technology and Applications Symposium.

[102]  Xiang Cheng,et al.  Reducing Operational Costs through Consolidation with Resource Prediction in the Cloud , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[103]  Zizhong Chen,et al.  TX: Algorithmic Energy Saving for Distributed Dense Matrix Factorizations , 2014, 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems.

[104]  Vincent Heuveline,et al.  Analysis and optimization of power consumption in the iterative solution of sparse linear systems on multi-core and many-core platforms , 2011, 2011 International Green Computing Conference and Workshops.

[105]  Christine Morin,et al.  Energy Management in IaaS Clouds: A Holistic Approach , 2012, 2012 IEEE Fifth International Conference on Cloud Computing.

[106]  Dong Li,et al.  Improving performance and energy efficiency of matrix multiplication via pipeline broadcast , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[107]  Hui Liu,et al.  High performance linpack benchmark: a fault tolerant implementation without checkpointing , 2011, ICS '11.

[108]  James H. Laros,et al.  Metrics for Evaluating Energy Saving Techniques for Resilient HPC Systems , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[109]  Rajkumar Buyya,et al.  Power Aware Scheduling of Bag-of-Tasks Applications with Deadline Constraints on DVS-enabled Clusters , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[110]  Zhiyuan Wang Reliability Speedup: An Effective Metric for Parallel Application with Checkpointing , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[111]  Albert Y. Zomaya,et al.  Some observations on optimal frequency selection in DVFS-based energy consumption minimization , 2011, J. Parallel Distributed Comput..

[112]  Bruce Jacob,et al.  A control-theoretic approach to dynamic voltage scheduling , 2003, CASES '03.

[113]  Michael S. Hsiao,et al.  Compiler-directed dynamic voltage/frequency scheduling for energy reduction in microprocessors , 2001, ISLPED '01.

[114]  Rong Ge,et al.  CPU MISER: A Performance-Directed, Run-Time System for Power-Aware Clusters , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[115]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[116]  Petru Eles,et al.  Scheduling and voltage scaling for energy/reliability trade-offs in fault-tolerant time-triggered embedded systems , 2007, 2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[117]  Shuaiwen Song,et al.  A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[118]  Alexandre Yakovlev,et al.  Studying the Interplay of Concurrency, Performance, Energy and Reliability with ArchOn -- An Architecture-Open Resource-Driven Cross-Layer Modelling Framework , 2014, 2014 14th International Conference on Application of Concurrency to System Design.

[119]  Enrique S. Quintana-Ortí,et al.  Improving power efficiency of dense linear algebra algorithms on multi-core processors via slack control , 2011, 2011 International Conference on High Performance Computing & Simulation.

[120]  Krishnendu Chakrabarty,et al.  Energy-Aware Fault Tolerance in Fixed-Priority Real-Time Embedded Systems , 2003, ICCAD 2003.

[121]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.