Efficient Integration of Containers into Scientific Workflows

Containers offer a powerful way to create portability for scientific applications. However yet incorporating them into workflows requires careful consideration, as straightforward approaches can increase network usage and runtime. We identified three issues in this process: container composition, containerizing workers or jobs, and container image translation. To tackle composition, we define data into three types: OS data, Read-Only, andWorking data, and define dynamic and static composition. Using the static composition (creating a single container for each job) leads to massive waste in sending duplicate data over the network. Dynamic composition (sending the data types separately) enables caching on worker nodes. To answer running workers or jobs inside a container, we looked at the costs of running inside of a container. Finally, when using different types of container technologies simultaneously, we found it's better to convert to the target image types before sending the container images, instead of repeating the same conversion at the job nodes, leading to more wasted time.

[1]  Gregory M. Kurtzer,et al.  Singularity 2.1.2 - Linux application and environment containers for science , 2016 .

[2]  Jano I. van Hemert,et al.  Scientific Workflows , 2016, ACM Comput. Surv..

[3]  Yong Zhao,et al.  A notation and system for expressing and executing cleanly typed workflows on messy scientific data , 2005, SGMD.

[4]  Reid Priedhorsky,et al.  Charliecloud: Unprivileged Containers for User-Defined Software Stacks in HPC , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[6]  Violeta Holmes,et al.  Orchestrating Docker Containers in the HPC Environment , 2015, ISC.

[7]  Andrea C. Arpaci-Dusseau,et al.  Slacker: Fast Distribution with Lazy Docker Containers , 2016, FAST.

[8]  Andreas Wilke,et al.  Skyport - Container-Based Execution Environment Management for Multi-cloud Scientific Workflows , 2014, 2014 5th International Workshop on Data-Intensive Computing in the Clouds.

[9]  Douglas Thain,et al.  Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids , 2012, SWEET '12.

[10]  Jorge Ejarque,et al.  Transparent Orchestration of Task-based Parallel Applications in Containers Platforms , 2018, Journal of Grid Computing.

[11]  Douglas Thain,et al.  Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker , 2015, VTDC@HPDC.