Job failures in high performance computing systems: A large-scale empirical study

[1]  F. Haight Handbook of the Poisson Distribution , 1967 .

[2]  James H. Greene,et al.  Production and Inventory Control Handbook , 1970 .

[3]  F. James Statistical Methods in Experimental Physics , 1973 .

[4]  Sheldon M. Ross,et al.  Introduction to Probability Models (4th ed.). , 1990 .

[5]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[6]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[7]  Mark A. Franklin,et al.  Checkpointing in Distributed Computing Systems , 1996, J. Parallel Distributed Comput..

[8]  William H. Sanders,et al.  Performance analysis of two time-based coordinated checkpointing protocols , 1997, Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems.

[9]  Sheldon M. Ross Introduction to probability models , 1998 .

[10]  Ravishankar K. Iyer,et al.  Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.

[11]  Dror G. Feitelson,et al.  Utilization and Predictability in Scheduling the IBM SP2 with Backfilling , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[12]  Dmitry N. Zotkin,et al.  Job-length estimation and performance in backfilling schedulers , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[13]  W. Cirne,et al.  A comprehensive model of the supercomputer workload , 2001, Proceedings of the Fourth Annual IEEE International Workshop on Workload Characterization. WWC-4 (Cat. No.01EX538).

[14]  Anand Sivasubramaniam,et al.  An Integrated Approach to Parallel Scheduling Using Gang-Scheduling, Backfilling, and Migration , 2001, JSSPP.

[15]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[16]  Amin Vahdat,et al.  Workload and Failure Characterization on a Large-Scale Federated Testbed , 2003 .

[17]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[18]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[19]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[20]  Dan Tsafrir,et al.  Modeling User Runtime Estimates , 2005, JSSPP.

[21]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[22]  Hui Li,et al.  Job Failure Analysis and Its Implications in a Large-Scale Production Grid , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[23]  Dror G. Feitelson,et al.  Locality of sampling and diversity in parallel system workloads , 2007, ICS '07.

[24]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[25]  Ming Wu,et al.  Performance under failures of high-end computing , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[26]  Alexandru Iosup,et al.  The Characteristics and Performance of Groups of Jobs in Grids , 2007, Euro-Par.

[27]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[28]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[29]  Jean-Marc Vincent,et al.  Mining for statistical models of availability in large-scale distributed systems: An empirical study of SETI@home , 2009, 2009 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems.

[30]  Jean-Marc Vincent,et al.  Mining for Availability Models in Large-Scale Distributed Systems:A Case Study of SETI@home , 2009 .

[31]  David P. Bunde,et al.  Scheduling Restartable Jobs with Short Test Runs , 2009, JSSPP.

[32]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[33]  Alexandru Iosup,et al.  The Failure Trace Archive: Enabling Comparative Analysis of Failures in Diverse Distributed Systems , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.