Online Fault and Anomaly Detection for Large-Scale Scientific Workflows

Scientific workflows are an enabler of complex scientific analyses. Large-scale scientific workflows are executed on complex parallel and distributed resources, where many things can fail. Application scientists need to track the status of their workflows in real time, detect execution anomalies automatically, and perform troubleshooting -- without logging into remote nodes or searching through thousands of log files. As part of the NSF-funded Synthesized Tools for Archiving Monitoring Performance and Enhanced DEbugging (STAMPEDE) project, we have developed an infrastructure to answer these needs by integrating detailed workflow and resource monitoring. On top of this infrastructure, we have developed analysis techniques for online detection of a wide variety of "hard" and "soft" types of failures. We use these detected failures to derive higher-level statistics about the status of the resources and the workflow as a whole. In this paper, we describe our techniques and evaluate their effectiveness in the context of real application logs.

[1]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[2]  Daniel A. Reed,et al.  Analysis of application heartbeats: Learning structural and temporal features in time series data for identification of performance problems , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Akinori Yonezawa,et al.  ParaTrac: a fine-grained profiler for data-intensive workflows , 2010, HPDC '10.

[5]  Brian Tierney,et al.  Scalable Analysis of Distributed Workflow Traces , 2005, PDPTA.

[6]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[7]  Yang Zhang,et al.  Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[8]  Junwei Cao,et al.  A Case Study on the Use of Workflow Technologies for Scientific Analysis: Gravitational Wave Data Analysis , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[9]  Michael Wilde,et al.  Kickstarting remote applications , 2006 .

[10]  Lavanya Ramakrishnan,et al.  WORKEM: Representing and Emulating Distributed Scientific Workflow Execution State , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[11]  Qian Zhu,et al.  Power-Aware Consolidation of Scientific Workflows in Virtualized Environments , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[13]  Brian Tierney,et al.  Log summarization and anomaly detection for troubleshooting distributed systems , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[14]  David G. Stork,et al.  Pattern Classification , 1973 .

[15]  Daniel S. Katz,et al.  Montage: a grid-enabled engine for delivering custom science-grade mosaics on demand , 2004, SPIE Astronomical Telescopes + Instrumentation.

[16]  Steve Vinoski,et al.  Advanced Message Queuing Protocol , 2006, IEEE Internet Computing.

[17]  Allan Snavely,et al.  A simulation toolkit to investigate the effects of grid characteristics on workflow completion time , 2009, WORKS '09.

[18]  Thomas Fahringer,et al.  Predicting the execution time of grid workflow applications through local learning , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[19]  Ran Wolff,et al.  Mining for misconfigured machines in grid systems , 2006, KDD '06.

[20]  David A. Cieslak,et al.  Troubleshooting thousands of jobs on production grids using data mining techniques , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[21]  Dennis Gannon,et al.  Workflows for e-Science, Scientific Workflows for Grids , 2014 .

[22]  Chuang Liu,et al.  Anomaly detection and diagnosis in grid environments , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[23]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..