论文信息 - Job failures in high performance computing systems: A large-scale empirical study - 字舞流文

Job failures in high performance computing systems: A large-scale empirical study

Guangwen Yang | Weimin Zheng | Qiuping Wang | Yongwei Wu | Yulai Yuan | Guangwen Yang | Weimin Zheng | Yulai Yuan | Yongwei Wu | Qiuping Wang

[1] F. Haight. Handbook of the Poisson Distribution , 1967 .

[2] James H. Greene,et al. Production and Inventory Control Handbook , 1970 .

[3] F. James. Statistical Methods in Experimental Physics , 1973 .

[4] Sheldon M. Ross,et al. Introduction to Probability Models (4th ed.). , 1990 .

[5] Ravishankar K. Iyer,et al. Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[6] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[7] Mark A. Franklin,et al. Checkpointing in Distributed Computing Systems , 1996, J. Parallel Distributed Comput..

[8] William H. Sanders,et al. Performance analysis of two time-based coordinated checkpointing protocols , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[9] Sheldon M. Ross. Introduction to probability models , 1998 .

[10] Ravishankar K. Iyer,et al. Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.

[11] Dror G. Feitelson,et al. Utilization and Predictability in Scheduling the IBM SP2 with Backfilling , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[12] Dmitry N. Zotkin,et al. Job-length estimation and performance in backfilling schedulers , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[13] W. Cirne,et al. A comprehensive model of the supercomputer workload , 2001, Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538).

[14] Anand Sivasubramaniam,et al. An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration , 2001, JSSPP.

[15] Richard P. Martin,et al. Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[16] Amin Vahdat,et al. Workload and Failure Characterization on a Large-Scale Federated Testbed , 2003 .

[17] Archana Ganapathi,et al. Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[18] Mark S. Squillante,et al. Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[19] Mark S. Squillante,et al. Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[20] Dan Tsafrir,et al. Modeling User Runtime Estimates , 2005, JSSPP.

[21] Richard Wolski,et al. Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[22] Hui Li,et al. Job Failure Analysis and Its Implications in a Large-Scale Production Grid , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[23] Dror G. Feitelson,et al. Locality of sampling and diversity in parallel system workloads , 2007, ICS '07.

[24] J. Sikora. Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[25] Ming Wu,et al. Performance under failures of high-end computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[26] Alexandru Iosup,et al. The Characteristics and Performance of Groups of Jobs in Grids , 2007, Euro-Par.

[27] Christian Engelmann,et al. Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[28] Cheng-Zhong Xu,et al. Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[29] Jean-Marc Vincent,et al. Mining for statistical models of availability in large-scale distributed systems: An empirical study of SETI@home , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[30] Jean-Marc Vincent,et al. Mining for Availability Models in Large-Scale Distributed Systems:A Case Study of SETI@home , 2009 .

[31] David P. Bunde,et al. Scheduling Restartable Jobs with Short Test Runs , 2009, JSSPP.

[32] Bianca Schroeder,et al. A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[33] Alexandru Iosup,et al. The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.