Chiron: a parallel engine for algebraic scientific workflows

Large‐scale scientific experiments based on computer simulations are typically modeled as scientific workflows, which eases the chaining of different programs. These scientific workflows are defined, executed, and monitored by scientific workflow management systems (SWfMS). As these experiments manage large amounts of data, it becomes critical to execute them in high‐performance computing environments, such as clusters, grids, and clouds. However, few SWfMS provide parallel support. The ones that do so are usually labor‐intensive for workflow developers and have limited primitives to optimize workflow execution. To address these issues, we developed workflow algebra to specify and enable the optimization of parallel execution of scientific workflows. In this paper, we show how the workflow algebra is efficiently implemented in Chiron, an algebraic based parallel scientific workflow engine. Chiron has a unique native distributed provenance mechanism that enables runtime queries in a relational database. We developed two studies to evaluate the performance of our algebraic approach implemented in Chiron; the first study compares Chiron with different approaches, whereas the second one evaluates the scalability of Chiron. By analyzing the results, we conclude that Chiron is efficient in executing scientific workflows, with the benefits of declarative specification and runtime provenance support. Copyright © 2013 John Wiley & Sons, Ltd.

[1]  David Abramson,et al.  Parameter Exploration in Science and Engineering Using Many-Task Computing , 2011, IEEE Transactions on Parallel and Distributed Systems.

[2]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[3]  Amel Mammar,et al.  A formal framework to generate XPDL specifications from UML activity diagrams , 2006, SAC '06.

[4]  Cláudio T. Silva,et al.  Provenance for Computational Tasks: A Survey , 2008, Computing in Science & Engineering.

[5]  Lawrence A. Crowl How to measure, present, and compare parallel performance , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[6]  Jianwu Wang,et al.  Kepler + Hadoop: a general architecture facilitating data-intensive applications in scientific workflow systems , 2009, WORKS '09.

[7]  Mark Baker,et al.  Nested parallelism for multi-core HPC systems using Java , 2009, J. Parallel Distributed Comput..

[8]  Luc Bouganim,et al.  Dynamic Load Balancing in Hierarchical Parallel Database Systems , 1996, VLDB.

[9]  Natalia Juristo Juzgado,et al.  Basics of Software Engineering Experimentation , 2010, Springer US.

[10]  Cláudio T. Silva,et al.  VisTrails: visualization meets data management , 2006, SIGMOD Conference.

[11]  Douglas G. Down,et al.  A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[12]  José A. B. Fortes,et al.  CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications , 2008, 2008 IEEE Fourth International Conference on eScience.

[13]  van der Wmp Wil Aalst,et al.  Workflow control-flow patterns : a revised view , 2006 .

[14]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[15]  David Abramson,et al.  Embedding optimization in computational science workflows , 2010, J. Comput. Sci..

[16]  Ewa Deelman,et al.  Pegasus: Mapping Large-Scale Workflows to Distributed Resources , 2007, Workflows for e-Science, Scientific Workflows for Grids.

[17]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[18]  Juliana Freire,et al.  Provenance and scientific workflows: challenges and opportunities , 2008, SIGMOD Conference.

[19]  Gilson A. Giraldi,et al.  Optimizing the pre-processing of scientific visualization techniques using QEF , 2010, MGC '10.

[20]  Marta Mattoso,et al.  Many task computing for orthologous genes identification in protozoan genomes using Hydra , 2011, Concurr. Comput. Pract. Exp..

[21]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[22]  Wil M. P. van der Aalst,et al.  Workflow Patterns , 2004, Distributed and Parallel Databases.

[23]  Ian J. Taylor,et al.  Workflows and e-Science: An overview of workflow system features and capabilities , 2009, Future Gener. Comput. Syst..

[24]  Marta Mattoso,et al.  Using Domain-Specific Data to Enhance Scientific Workflow Steering Queries , 2012, IPAW.

[25]  E. F. Codd,et al.  The Relational Model for Database Management, Version 2 , 1990 .

[26]  Yolanda Gil,et al.  Wings for Pegasus: Creating Large-Scale Scientific Applications Using Semantic Representations of Computational Workflows , 2007, AAAI.

[27]  Bruno Schulze,et al.  QEF - Supporting Complex Query Applications , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[28]  E. G. Lemos,et al.  Algebraic approach to optimal clone selection applied in metagenomic projects , 2010 .

[29]  Bertram Ludäscher,et al.  Kepler: an extensible system for design and execution of scientific workflows , 2004, Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004..

[30]  Geoffrey C. Fox,et al.  MPJ: MPI-like message passing for Java , 2000, Concurr. Pract. Exp..

[31]  Johan Montagnat,et al.  A data-driven workflow language for grids based on array programming principles , 2009, WORKS '09.

[32]  Edward Walker,et al.  Challenges in executing large parameter sweep studies across widely distributed computing environments , 2007, CLADE '07.

[33]  Marta Mattoso,et al.  Exploring many task computing in scientific workflows , 2009, MTAGS '09.

[34]  R. Larsen An introduction to mathematical statistics and its applications / Richard J. Larsen, Morris L. Marx , 1986 .

[35]  Marta Mattoso,et al.  An algebraic approach for data-centric scientific workflows , 2011, Proc. VLDB Endow..