Parallelizing user–defined functions in the ETL workflow using orchestration style sheets

Abstract Today’s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the risk that the custom code is not intended to be optimized, e.g., by parallelism, and for this reason, it performs poorly for data-intensive ETL workflows. In this paper we present a novel framework, which allows the ETL developer to choose a design pattern in order to write parallelizable code and generates a configuration for the UDFs to be executed in a distributed environment. This enables ETL developers with minimum expertise in distributed and parallel computing to develop UDFs without taking care of parallelization configurations and complexities. We perform experiments on large-scale datasets based on TPC-DS and BigBench. The results show that our approach significantly reduces the effort of ETL developers and at the same time generates efficient parallel configurations to support complex and data-intensive ETL tasks.

[1]  Mark Last,et al.  Interpretable decision-tree induction in a big data parallel framework , 2017, Int. J. Appl. Math. Comput. Sci..

[2]  Norman May,et al.  A study of partitioning and parallel UDF execution with the SAP HANA database , 2014, SSDBM '14.

[3]  Horacio González-Vélez,et al.  Performance evaluation of MapReduce using full virtualisation on a departmental cloud , 2011, Int. J. Appl. Math. Comput. Sci..

[4]  Torben Bach Pedersen,et al.  Easy and effective parallel programmable ETL , 2011, DOLAP '11.

[5]  Torben Bach Pedersen,et al.  ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce , 2013, Trans. Large Scale Data Knowl. Centered Syst..

[6]  Immo Huismann,et al.  Using Semantics-Aware Composition and Weaving for Multi-Variant Progressive Parallelization , 2016, ICCS.

[7]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[8]  Nitin Kumar,et al.  An Efficient Heuristic for Logical Optimization of ETL Workflows , 2010, BIRTE.

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  Torbjörn Ekman,et al.  The JastAdd system - modular extensible compiler construction , 2007, Sci. Comput. Program..

[11]  Robert Wrembel,et al.  From conceptual design to performance optimization of ETL workflows: current state of research and open problems , 2017, The VLDB Journal.

[12]  Dominic Battré,et al.  Nephele/PACTs: a programming model and execution framework for web-scale analytical processing , 2010, SoCC '10.

[13]  Sven Karol Well-formed and scalable invasive software composition , 2014 .

[14]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[15]  Syed Muhammad Fawad Ali,et al.  Next-generation ETL Framework to Address the Challenges Posed by Big Data , 2018, DOLAP.

[16]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[17]  Kevin Wilkinson,et al.  Optimizing ETL workflows for fault-tolerance , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[18]  Muthu Dayalan,et al.  MapReduce : Simplified Data Processing on Large Cluster , 2018 .

[19]  Raghunath Othayoth Nambiar,et al.  The making of TPC-DS , 2006, VLDB.

[20]  Timos K. Sellis,et al.  State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[21]  Panos Vassiliadis,et al.  Scheduling strategies for efficient ETL execution , 2013, Inf. Syst..