Building Pipelines for Heterogeneous Execution Environments for Big Data Processing

Many real-world data analysis scenarios require pipelining and integration of multiple (big) data-processing and data-analytics jobs, which often execute in heterogeneous environments, such as MapReduce; Spark; or R, Python, or Bash scripts. Such a pipeline requires much glue code to get data across environments. Maintaining and evolving these pipelines are difficult. Pipeline frameworks that try to solve such problems are usually built on top of a single environment. They might require rewriting the original job to take into account a new API or paradigm. The Pipeline61 framework supports the building of data pipelines involving heterogeneous execution environments. It reuses the existing code of the deployed jobs in different environments and provides version control and dependency management that deals with typical software engineering issues. A real-world case study shows its effectiveness. This article is part of a special issue on Software Engineering for Big Data Systems.