Adaptive and Power-aware Fault Tolerance for Future Extreme-scale Computing

Two major trends in large-scale computing are the rapid growth in HPC with in particular an international exascale initiative, and the dramatic expansion of Cloud infrastructures accompanied by the Big Data passion. To satisfy the continuous demands for increasing computing capacity, future extreme-scale systems will embrace a multi-fold increase in the number of computing, storage, and communication components, in order to support an unprecedented level of parallelism. Despite the capacity and economies benefits, making the upward transformation to extreme-scale poses numerous scientific and technological challenges, two of which are the power consumption and fault tolerance. With the increase in system scale, failure would become a norm rather than an exception, driving the system to significantly lower efficiency with unforeseen power consumption. This thesis aims at simultaneously addressing the above two challenges by introducing a novel fault-tolerant computational model, referred to as \textit{Leaping Shadows}. Based on Shadow Replication, Leaping Shadows associates with each main process a suite of coordinated shadow processes, which execute in parallel but at differential rates, to deal with failures and meet the QoS requirements of the underlying application under strict power/energy constraints. In failure-prone extreme-scale computing environments, this new model addresses the limitations of the basic Shadow Replication model, and achieves adaptive and power-aware fault tolerance that is more time and energy efficient than existing techniques. In this thesis, we first present an analytical model based optimization framework that demonstrates Shadow Replication's adaptivity and flexibility in achieving multi-dimensional QoS requirements. Then, we introduce Leaping Shadows as a novel power-aware fault tolerance model, which tolerates multiple types of failures, guarantees forward progress, and maintains a consistent level of resilience. Lastly, the details of a Leaping Shadows implementation in MPI is discussed, along with extensive performance evaluation that includes comparison to checkpoint/restart. Collectively, these efforts advocate an adaptive and power-aware fault tolerance alternative for future extreme-scale computing.

[1]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[2]  Fred B. Schneider What good are models and what models are good , 1993 .

[3]  Borivoje Nikolic,et al.  Opportunities for Fine-Grained Adaptive Voltage Scaling to Improve System-Level Energy Efficiency , 2015 .

[4]  Ricardo Bianchini,et al.  Conserving disk energy in network servers , 2003, ICS '03.

[5]  André Schiper,et al.  Replication for send-deterministic MPI HPC applications , 2013, FTXS '13.

[6]  Jong Kim,et al.  Probabilistic checkpointing , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  Babak Falsafi,et al.  Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[9]  Daniel Marques,et al.  Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, IPDPS.

[10]  Christian Engelmann,et al.  Redundant Execution of HPC Applications with MR-MPI , 2011 .

[11]  Louise E. Moser,et al.  Fault Tolerance Middleware for Cloud Computing , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[12]  Stijn Eyerman,et al.  Fine-grained DVFS using on-chip regulators , 2011, TACO.

[13]  Brian Randell,et al.  System structure for software fault tolerance , 1975, IEEE Transactions on Software Engineering.

[14]  Bryan Mills,et al.  Power-aware resilience for exascale computing , 2014 .

[15]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[16]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[17]  Christian Engelmann,et al.  Proactive Fault Tolerance Using Preemptive Migration , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[18]  Kai Li,et al.  Faster checkpointing with N+1 parity , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[19]  Sara Bouchenak,et al.  Benchmarking Dependability of MapReduce Systems , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[20]  Luís Moura Silva,et al.  Using two-level stable storge for efficient checkpointing , 1998, IEE Proc. Softw..

[21]  Bianca Schroeder,et al.  Understanding failures in petascale computers , 2007 .

[22]  Henri Casanova,et al.  Combining Process Replication and Checkpointing for Resilience on Exascale Systems , 2012 .

[23]  Roberto Baldoni,et al.  Total Order Communications: A Practical Analysis , 2005, EDCC.

[24]  Laxmikant V. Kalé,et al.  Energy profile of rollback-recovery strategies in high performance computing , 2014, Parallel Comput..

[25]  James H. Laros,et al.  Redundant computing for exascale systems. , 2010 .

[26]  Franck Cappello,et al.  Fault prediction under the microscope: A closer look into HPC systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Gernot Heiser,et al.  Dynamic voltage and frequency scaling: the laws of diminishing returns , 2010 .

[28]  Vincent K. N. Lau,et al.  Automatic Performance Setting for Dynamic Voltage Scaling , 2002, Wirel. Networks.

[29]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[30]  Felix C. Gärtner,et al.  Fundamentals of fault-tolerant distributed computing in asynchronous environments , 1999, CSUR.

[31]  Laxmikant V. Kalé,et al.  FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[32]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[33]  G. Sohi,et al.  A static power model for architects , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[34]  Dakai Zhu,et al.  Global Reliability-Aware Power Management for Multiprocessor Real-Time Systems , 2010, 2010 IEEE 16th International Conference on Embedded and Real-Time Computing Systems and Applications.

[35]  Laxmikant V. Kalé,et al.  Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[36]  Paulo Veríssimo,et al.  Resilient state machine replication , 2005, 11th Pacific Rim International Symposium on Dependable Computing (PRDC'05).

[37]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[38]  Hong Zhu,et al.  A survey of practical algorithms for suffix tree construction in external memory , 2010 .

[39]  Wu-chun Feng,et al.  A Power-Aware Run-Time System for High-Performance Computing , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[40]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[41]  George Bosilca,et al.  High Performance RDMA Protocols in HPC , 2006, PVM/MPI.

[42]  Franck Cappello,et al.  Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications , 2016, IEEE Transactions on Parallel and Distributed Systems.

[43]  Israel Koren,et al.  Fault-Tolerant Systems , 2007 .

[44]  Luiz André Barroso,et al.  The Price of Performance , 2005, ACM Queue.

[45]  Flaviu Cristian,et al.  Understanding fault-tolerant distributed systems , 1991, CACM.

[46]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[47]  Bronis R. de Supinski,et al.  Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[48]  Rami G. Melhem,et al.  Maximizing rewards for real-time applications with energy constraints , 2003, TECS.

[49]  Dirk Grunwald,et al.  Massive Arrays of Idle Disks For Storage Archives , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[50]  Wei-Tek Tsai,et al.  Service Replication Strategies with MapReduce in Clouds , 2011, 2011 Tenth International Symposium on Autonomous Decentralized Systems.

[51]  Ricardo Bianchini,et al.  Exploiting redundancy to conserve energy in storage systems , 2006, SIGMETRICS '06/Performance '06.

[52]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[53]  Taieb Znati,et al.  Shadow Replication: An Energy-Aware, Fault-Tolerant Computational Model for Green Cloud Computing , 2014 .

[54]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[55]  Heon Young Yeom,et al.  An efficient algorithm for causal message logging , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[56]  Rami G. Melhem,et al.  Adaptive and Power-Aware Resilience for Extreme-Scale Computing , 2016, 2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld).

[57]  Erik Seligman,et al.  Application Level Fault Tolerance in Heterogenous Networks of Workstations , 1997, J. Parallel Distributed Comput..

[58]  George Bosilca,et al.  Algorithm-based fault tolerance applied to high performance computing , 2009, J. Parallel Distributed Comput..

[59]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[60]  Lorenzo Alvisi,et al.  An analysis of communication induced checkpointing , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[61]  Laurent Broto,et al.  Approaches to cloud computing fault tolerance , 2012, 2012 International Conference on Computer, Information and Telecommunication Systems (CITS).

[62]  Thomas Hérault,et al.  Post-failure recovery of MPI communication capability , 2013, Int. J. High Perform. Comput. Appl..

[63]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[64]  David Fiala Detection and correction of silent data corruption for large-scale high-performance computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[65]  James H. Laros,et al.  Does partial replication pay off? , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[66]  E. N. Elnozahy How safe is probabilistic checkpointing? , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[67]  Xiaohui Gu,et al.  Understanding Real World Data Corruptions in Cloud Systems , 2015, 2015 IEEE International Conference on Cloud Engineering.

[68]  Sean W. Smith,et al.  Minimizing timestamp size for completely asynchronous optimistic recovery with minimal rollback , 1995, Proceedings 15th Symposium on Reliable Distributed Systems.

[69]  Franck Cappello,et al.  Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[70]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[71]  Qin Zheng Improving MapReduce fault tolerance in the cloud , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[72]  Thomas Hérault,et al.  Failure Detection and Propagation in HPC systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[73]  Ten-Hwang Lai,et al.  On Distributed Snapshots , 1987, Inf. Process. Lett..

[74]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[75]  Dakai Zhu,et al.  Reliability-aware Dynamic Voltage Scaling for energy-constrained real-time embedded systems , 2008, 2008 IEEE International Conference on Computer Design.

[76]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[77]  V. Rajaraman,et al.  A survey of checkpointing algorithms for parallel and distributed computers , 2000 .

[78]  Kang G. Shin,et al.  Real-time dynamic voltage scaling for low-power embedded operating systems , 2001, SOSP.

[79]  Yuanyuan Zhou,et al.  DMA-aware memory energy management , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[80]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[81]  Lorenzo Alvisi,et al.  Trade-offs in implementing causal message logging protocols , 1996, PODC '96.

[82]  Minna Palmroth,et al.  Topology Aware Process Mapping , 2012, PARA.

[83]  Franck Cappello,et al.  Toward Exascale Resilience , 2009, Int. J. High Perform. Comput. Appl..

[84]  Rong Ge,et al.  CPU MISER: A Performance-Directed, Run-Time System for Power-Aware Clusters , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[85]  Leslie Lamport,et al.  The part-time parliament , 1998, TOCS.

[86]  Laxmikant V. Kalé,et al.  Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[87]  E. N. Elnozahy,et al.  Energy Conservation Policies for Web Servers , 2003, USENIX Symposium on Internet Technologies and Systems.

[88]  Rami G. Melhem,et al.  Shadow Computing: An energy-aware fault tolerant computing model , 2014, 2014 International Conference on Computing, Networking and Communications (ICNC).

[89]  Chris Fallin,et al.  Memory power management via dynamic voltage/frequency scaling , 2011, ICAC '11.

[90]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[91]  Robert C. Aitken,et al.  Impact of Technology and Voltage Scaling on the Soft Error Susceptibility in Nanoscale CMOS , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[92]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[93]  David Blaauw,et al.  Theoretical and practical limits of dynamic voltage scaling , 2004, Proceedings. 41st Design Automation Conference, 2004..

[94]  Kesheng Wu,et al.  Scientific Discovery at the Exascale , 2011 .

[95]  S. Huang,et al.  Energy-Efficient Cluster Computing via Accurate Workload Characterization , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[96]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[97]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[98]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[99]  Rami G. Melhem,et al.  Shadows on the Cloud: An Energy-aware, Profit Maximizing Resilience Framework for Cloud Computing , 2014, CLOSER.

[100]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[101]  Laxmikant V. Kalé,et al.  A ‘cool’ way of improving the reliability of HPC machines , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[102]  Ananta Tiwari,et al.  Green Queue: Customized Large-Scale Clock Frequency Scaling , 2012, 2012 Second International Conference on Cloud and Green Computing.

[103]  Sparsh Mittal,et al.  Power Management Techniques for Data Centers: A Survey , 2014, ArXiv.

[104]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[105]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[106]  David K. Lowenthal,et al.  Just In Time Dynamic Voltage Scaling: Exploiting Inter-Node Slack to Save Energy in MPI Programs , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[107]  C. V. Ramamoorthy,et al.  Rollback and Recovery Strategies for Computer Programs , 1972, IEEE Transactions on Computers.

[108]  David B. Skillicorn,et al.  Questions and Answers about BSP , 1997, Sci. Program..

[109]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[110]  Vanish Talwar,et al.  No "power" struggles: coordinated multi-level power management for the data center , 2008, ASPLOS.

[111]  D.K. Lowenthal,et al.  Adaptive, Transparent Frequency and Voltage Scaling of Communication Phases in MPI Programs , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[112]  Johan Vounckx,et al.  Survey of Backward Error Recovery Techniques for Multicomputers Based on Checkpointing and Rollback , 1993 .

[113]  Achour Mostéfaoui,et al.  Preventing useless checkpoints in distributed computations , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.

[114]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[115]  Dakai Zhu,et al.  Generalized reliability-oriented energy management for real-time embedded applications , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[116]  Bran Selic,et al.  A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems , 2013, The Journal of Supercomputing.

[117]  William Gropp,et al.  Towards a More Complete Understanding of SDC Propagation , 2017, HPDC.

[118]  Yi-Min Wang,et al.  Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[119]  Bo Fang,et al.  LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures , 2017, HPDC.

[120]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[121]  Calton Pu,et al.  Impact of DVFS on n-tier application performance , 2013, TRIOS@SOSP.

[122]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[123]  Michael Franz,et al.  Power reduction techniques for microprocessor systems , 2005, CSUR.

[124]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[125]  Seetharami R. Seelam,et al.  Modeling the Impact of Checkpoints on Next-Generation Systems , 2007, 24th IEEE Conference on Mass Storage Systems and Technologies (MSST 2007).

[126]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[127]  Franklin T. Luk,et al.  An Analysis of Algorithm-Based Fault Tolerance Techniques , 1988, J. Parallel Distributed Comput..

[128]  Susanne Albers,et al.  Energy-efficient algorithms , 2010, Commun. ACM.

[129]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[130]  Satoshi Matsuoka,et al.  Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system , 2013, FTXS '13.

[131]  Bronis R. de Supinski,et al.  Adagio: making DVS practical for complex HPC applications , 2009, ICS.

[132]  Laurent Lefèvre,et al.  A survey on techniques for improving the energy efficiency of large-scale distributed systems , 2014, ACM Comput. Surv..

[133]  Rami G. Melhem,et al.  Energy Consumption of Resilience Mechanisms in Large Scale Systems , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[134]  Thomas Hérault,et al.  Practical scalable consensus for pseudo-synchronous distributed systems , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.