Failure analysis of distributed scientific workflows executing in the cloud

This work presents models characterizing failures observed during the execution of large scientific applications on Amazon EC2. Scientific workflows are used as the underlying abstraction for application representations. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. We study job failure models for data collected from 4 scientific applications, by our Stampede framework. In particular, we show that a Naive Bayes classifier can accurately predict the failure probability of jobs. The models allow us to predict job failures for a given execution resource and then use these failure predictions for two higher-level goals: (1) to suggest a better job assignment, and (2) to provide quantitative feedback to the workflow component developer about the robustness of their application codes.

[1]  David A. Cieslak,et al.  Troubleshooting thousands of jobs on production grids using data mining techniques , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[2]  Ewa Deelman,et al.  Scientific Workflows in the Cloud , 2011 .

[3]  Brian Tierney,et al.  Log summarization and anomaly detection for troubleshooting distributed systems , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[4]  Ewa Deelman,et al.  Online Fault and Anomaly Detection for Large-Scale Scientific Workflows , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[5]  Anand Sivasubramaniam,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Ewa Deelman,et al.  Failure prediction and localization in large scientific workflows , 2011, WORKS '11.

[7]  Alexandru Iosup,et al.  Performance Analysis of Cloud Computing Services for Many-Tasks Scientific Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[8]  Michael Wilde,et al.  Kickstarting remote applications , 2006 .

[9]  Daniel S. Katz,et al.  Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand , 2004, SPIE Astronomical Telescopes + Instrumentation.

[10]  Ran Wolff,et al.  Mining for misconfigured machines in grid systems , 2006, KDD '06.

[11]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[12]  G. Bruce Berriman,et al.  Data Sharing Options for Scientific Workflows on Amazon EC2 , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Lavanya Ramakrishnan,et al.  Magellan: experiences from a science cloud , 2011, ScienceCloud '11.

[14]  Zhiling Lan,et al.  Toward Automated Anomaly Identification in Large-Scale Systems , 2010, IEEE Transactions on Parallel and Distributed Systems.

[15]  Daniel A. Reed,et al.  Analysis of application heartbeats: Learning structural and temporal features in time series data for identification of performance problems , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Chuang Liu,et al.  Anomaly detection and diagnosis in grid environments , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[17]  D. Martin Swany,et al.  Online workflow management and performance analysis with Stampede , 2011, 2011 7th International Conference on Network and Service Management.

[18]  Wenguang Chen,et al.  Cloud versus in-house cluster: Evaluating Amazon cluster compute instances for running MPI applications , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  Narayan Desai,et al.  Co-analysis of RAS Log and Job Log on Blue Gene/P , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[20]  Ewa Deelman,et al.  Experiences using cloud computing for a scientific workflow application , 2011, ScienceCloud '11.

[21]  Brian Tierney,et al.  NetLogger: A Toolkit for Distributed System Performance Tuning and Debugging , 2003, Integrated Network Management.

[22]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[23]  Kashi Venkatesh Vishwanath,et al.  Characterizing cloud computing hardware reliability , 2010, SoCC '10.

[24]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..