Tigres Workflow Library: Supporting Scientific Pipelines on HPC Systems

The growth in scientific data volumes has resulted in the need for new tools that enable users to operate on and analyze data on large-scale resources. In the last decade, a number of scientific workflow tools have emerged. These tools often target distributed environments, and often need expert help to compose and execute the workflows. Data-intensive workflows are often ad-hoc, they involve an iterative development process that includes users composing and testing their workflows on desktops, and scaling up to larger systems. In this paper, we present the design and implementation of Tigres, a workflow library that supports the iterative workflow development cycle of data-intensive workflows. Tigres provides an application programming interface to a set of programming templates i.e., sequence, parallel, split, merge, that can be used to compose and execute computational and data pipelines. We discuss the results of our evaluation of scientific and synthetic workflows showing Tigres performs with minimal template overheads (mean of 13 seconds over all experiments). We also discuss various factors (e.g., I/O performance, execution mechansims) that affect the performance of scientific workflows on HPC systems.

[1]  C. A. R. Hoare,et al.  Communicating sequential processes , 1978, CACM.

[2]  Yogesh L. Simmhan,et al.  The Trident Scientific Workflow Workbench , 2008, 2008 IEEE Fourth International Conference on eScience.

[3]  Lavanya Ramakrishnan,et al.  CAMP: Community Access MODIS Pipeline , 2014, Future Gener. Comput. Syst..

[4]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[5]  Wei Chen,et al.  FireWorks: a dynamic workflow system designed for high‐throughput applications , 2015, Concurr. Comput. Pract. Exp..

[6]  Marc Snir,et al.  Programming Patterns for Architecture-Level Software Optimizations on Frequent Pattern Mining , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[7]  Lavanya Ramakrishnan,et al.  Riding the elephant: managing ensembles with hadoop , 2011, MTAGS '11.

[8]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[9]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[10]  Hans De Sterck,et al.  CloudWF: A Computational Workflow System for Clouds Based on Hadoop , 2009, CloudCom.

[11]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[12]  Lavanya Ramakrishnan,et al.  Magellan: experiences from a science cloud , 2011, ScienceCloud '11.

[13]  Jason Duell,et al.  Productivity and performance using partitioned global address space languages , 2007, PASCO '07.

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[16]  Daniel S. Katz,et al.  Pegasus: A framework for mapping complex scientific workflows onto distributed systems , 2005, Sci. Program..

[17]  Lavanya Ramakrishnan,et al.  Combining Workflow Templates with a Shared Space-Based Execution Model , 2014, 2014 9th Workshop on Workflows in Support of Large-Scale Science.

[18]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[19]  Edward A. Lee,et al.  CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2000; 00:1–7 Prepared using cpeauth.cls [Version: 2002/09/19 v2.02] Taverna: Lessons in creating , 2022 .

[20]  Edward A. Lee,et al.  Scientific workflow management and the Kepler system , 2006, Concurr. Comput. Pract. Exp..

[21]  Jason Maassen,et al.  Programming Scientific and Distributed Workflow with Triana Services , 2004 .

[22]  Douglas Thain,et al.  Weaver: integrating distributed computing abstractions into scientific workflows using Python , 2010, HPDC '10.