Reliability-aware scalability models for high performance computing
暂无分享,去创建一个
[1] Zhiling Lan,et al. Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.
[2] Ron A. Oldfield. Lightweight storage and overlay networks for fault tolerance. , 2006 .
[3] Vipin Kumar,et al. Analysis of scalability of parallel algorithms and architectures: a survey , 1991, ICS '91.
[4] Zhiling Lan,et al. Adaptive Fault Management of Parallel Applications for High-Performance Computing , 2008, IEEE Transactions on Computers.
[5] Ming Wu,et al. Scalability of heterogeneous computing , 2005, 2005 International Conference on Parallel Processing (ICPP'05).
[6] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.
[7] Mark S. Squillante,et al. Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.
[8] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..
[9] Hsien-Hsin S. Lee,et al. Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.
[10] Ravishankar K. Iyer,et al. Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).
[11] Jon Stearley,et al. What Supercomputers Say: A Study of Five System Logs , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).
[12] Kishor S. Trivedi,et al. Minimizing completion time of a program by checkpointing and rejuvenation , 1996, SIGMETRICS '96.
[13] W YoungJohn. A first order approximation to the optimum checkpoint interval , 1974 .
[14] Gregory A. Koenig,et al. Cluster Survivability with ByzwATCh: A Byzantine Hardware Fault Detector for Parallel Machines with Charm++ , 2006 .
[15] Ming Wu,et al. Algorithm-system scalability of heterogeneous computing , 2008, J. Parallel Distributed Comput..
[16] E. N. Elnozahy,et al. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.
[17] Larry Rudolph,et al. Cooperative checkpointing: a robust approach to large-scale systems reliability , 2006, ICS '06.
[18] John L. Gustafson,et al. Reevaluating Amdahl's law , 1988, CACM.
[19] Charng-Da Lu,et al. Big Systems and Big Reliability Challenges , 2003, PARCO.
[20] John T. Daly,et al. A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..
[21] Sarah Ellen Michalak,et al. Application MTTFE vs. Platform MTBF: A Fresh Perspective on System Reliability and Application Throughput for Computations at Scale , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).
[22] Rong Ge,et al. Power-Aware Speedup , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[23] John W. Young,et al. A first order approximation to the optimum checkpoint interval , 1974, CACM.
[24] Thomas Hérault,et al. Improved message logging versus improved coordinated checkpointing for fault tolerant MPI , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[25] Meeta Sharma Gupta,et al. Performance implications of periodic checkpointing on large-scale cluster systems , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[26] James S. Plank,et al. Experimental assessment of workstation failures and their impact on checkpointing systems , 1998, Digest of Papers. Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing (Cat. No.98CB36224).
[27] Fabrizio Petrini,et al. System-level fault-tolerance in large-scale parallel machines with buffered coscheduling , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[28] Chao Wang,et al. Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[29] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).
[30] Sarala Arunagiri,et al. Opportunistic Checkpoint Intervals to Improve System Performance , 2008 .
[31] Lionel M. Ni,et al. Another view on parallel speedup , 1990, Proceedings SUPERCOMPUTING '90.
[32] Daniel P. Siewiorek,et al. Error log analysis: statistical modeling and heuristic trend analysis , 1990 .
[33] Ming Wu,et al. Performance under failures of high-end computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).