A Resilience Approach to High-Performance Workflows

This report presents an approach to design, implement and deploy resilient distributed workflows. It supports the smooth integration of existing software for simulation applications, e.g. Matlab, Scilab, Python, OpenFOAM, Paraview and application programs. The contribution of the report is a new feature which supports resilience, i.e., application-level fault-tolerance and exception-handling. Connections with exascale computing requirements are also made. An overview of a prototype implementation based on the YAWL workflow management system is given.

[1]  Moustafa Ghanem,et al.  Grid-Enabled Workflows for Industrial Product Design , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[2]  Daniel Crawl,et al.  A Provenance-Based Fault Tolerance Mechanism for Scientific Workflows , 2008, IPAW.

[3]  Gregory A. Koenig,et al.  Byzantine Anomaly Testing for Charm++: Providing Fault Tolerance and Survivability for Charm++ Empowered Clusters , 2006, Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID'06).

[4]  Nathaniel Palmer,et al.  Workflow Management Coalition , 2009, Encyclopedia of Database Systems.

[5]  Jianwu Wang,et al.  A High-Level Distributed Execution Framework for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[6]  Lavanya Ramakrishnan,et al.  VGrADS: enabling e-Science workflows on grids and clouds with fault tolerance , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Denis Caromel,et al.  An Efficient Framework for Running Applications on Clusters, Grids, and Clouds , 2010, Cloud Computing.

[8]  Dhabaleswar K. Panda,et al.  CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems , 2009, 2009 International Conference on Parallel Processing.

[9]  Jean-Philippe Chancelier,et al.  Introduction to Scilab , 2010 .

[10]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[11]  Jean-Antoine Désidéri,et al.  Dynamic Resilient Workflows for Collaborative Design , 2009, CDVE.

[12]  Michael J. Adams,et al.  Facilitating dynamic flexibility and exception handling for workflows , 2007 .

[13]  Daniel A. Reed,et al.  Fault Tolerance and Recovery of Scientific Workflows on Computational Grids , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[14]  Yolanda Gil,et al.  Managing Large-Scale Scientific Workflows in Distributed Environments: Experiences and Challenges , 2006, 2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06).

[15]  M. Z. Muehlen,et al.  Workflow Management Coalition , 2000 .

[16]  Yogesh Simmhan,et al.  Building the Trident Scientific Workflow Workbench for Data Management in the Cloud , 2009, 2009 Third International Conference on Advanced Engineering Computing and Applications in Sciences.

[17]  Gregory A. Koenig,et al.  Byzantine Anomaly Testing for Charm++: Providing Fault Tolerance and Survivability for Charm++ Empowered Clusters , 2006 .

[18]  David Abramson,et al.  Embedding optimization in computational science workflows , 2010, J. Comput. Sci..