Reliability-aware scalability models for high performance computing

Scalability models are powerful analytical tools for evaluating and predicting the performance of parallel applications. Unfortunately, existing scalability models do not quantify failure impact and therefore cannot accurately account for application performance in the presence of failures. In this study, we extend two well-known models, namely Amdahl's law and Gustafson's law, by considering the impact of failures and the effect of fault tolerance techniques on applications. The derived reliability-aware models can be used to predict application scalability in failure-present environments and evaluate fault tolerance techniques. Trace-based simulations via real failure logs demonstrate that the newly developed models provide a better understanding of application performance and scalability in the presence of failures.

[1]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[2]  Ron A. Oldfield Lightweight storage and overlay networks for fault tolerance. , 2006 .

[3]  Vipin Kumar,et al.  Analysis of scalability of parallel algorithms and architectures: a survey , 1991, ICS '91.

[4]  Zhiling Lan,et al.  Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.

[5]  Ming Wu,et al.  Scalability of heterogeneous computing , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[6]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[7]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[8]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[9]  Hsien-Hsin S. Lee,et al.  Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.

[10]  Ravishankar K. Iyer,et al.  Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[11]  Jon Stearley,et al.  What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[12]  Kishor S. Trivedi,et al.  Minimizing completion time of a program by checkpointing and rejuvenation , 1996, SIGMETRICS '96.

[13]  W YoungJohn A first order approximation to the optimum checkpoint interval , 1974 .

[14]  Gregory A. Koenig,et al.  Cluster Survivability with ByzwATCh: A Byzantine Hardware Fault Detector for Parallel Machines with Charm++ , 2006 .

[15]  Ming Wu,et al.  Algorithm-system scalability of heterogeneous computing , 2008, J. Parallel Distributed Comput..

[16]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[17]  Larry Rudolph,et al.  Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.

[18]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[19]  Charng-Da Lu,et al.  Big Systems and Big Reliability Challenges , 2003, PARCO.

[20]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[21]  Sarah Ellen Michalak,et al.  Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[22]  Rong Ge,et al.  Power-Aware Speedup , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  John W. Young,et al.  A first order approximation to the optimum checkpoint interval , 1974, CACM.

[24]  Thomas Hérault,et al.  Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[25]  Meeta Sharma Gupta,et al.  Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[26]  James S. Plank,et al.  Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).

[27]  Fabrizio Petrini,et al.  System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[28]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[30]  Sarala Arunagiri,et al.  Opportunistic Checkpoint Intervals to Improve System Performance , 2008 .

[31]  Lionel M. Ni,et al.  Another view on parallel speedup , 1990, Proceedings SUPERCOMPUTING '90.

[32]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[33]  Ming Wu,et al.  Performance under failures of high-end computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).