A schema aware ETL workflow generator

Extract, Transform and Load (ETL) processes organized as workflows play an important role in data warehousing. As ETL workflows are usually complex, various ETL facilities have been developed to address their control-flow process modeling and execution control. To evaluate the quality of ETL facilities, Synthetic ETL workflow test cases, consisting of control-flow and data-flow aspects are needed to check ETL facility functionalities at construction time and to validate the correctness and performance of ETL facilities at run time. Although there are some synthetic workflow and data set test case generation approaches existed in literatures, little work is done to consider both aspects at the same time specifically for ETL workflow generators. To address this issue, this paper proposes a schema aware ETL workflow generator with which users can characterize their ETL workflows by various parameters and get ETL workflow test cases with control-flow of ETL activities, complied schemas and associated recordsets. Our generator consists of three steps. First, with type and ratio of individual activities and their connection characteristic parameter specification, the generator will produce ETL activities and form ETL skeleton which determine how generated activities are cooperated with each other. Second, with schema transformation characteristic parameter specification, e.g. ranges of numbers of attributes, the generator will resolve attribute dependencies and refine input/output schemas with complied attributes and their data types. In the last step, recordsets are generated following cardinality specifications. ETL workflows in specific patterns are produced in the experiment in order to show the ability of our generator. Also experiments to generate thousands of ETL workflow test cases in seconds have been done to verify the usability of the generator.

[1]  Alexander Zeier,et al.  A mixed transaction processing and operational reporting benchmark , 2011, Inf. Syst. Frontiers.

[2]  Richard T. Snodgrass,et al.  Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data : SIGMOD '94, Minneapolis, Minnesota, May 24-27, 1994 , 1994, ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems.

[3]  Panos Vassiliadis,et al.  Deciding the physical implementation of ETL workflows , 2007, DOLAP '07.

[4]  Umeshwar Dayal,et al.  Benchmarking ETL Workflows , 2009, TPCTC.

[5]  Meikel Pöss,et al.  Generating Thousand Benchmark Queries in Seconds , 2004, VLDB.

[6]  Lila Rao-Graham,et al.  An approach for incorporating quality-based cost–benefit analysis in data warehouse design , 2008, Inf. Syst. Frontiers.

[7]  Carsten Binnig,et al.  QAGen: generating query-aware test databases , 2007, ACM SIGMOD Conference.

[8]  Christopher Olston,et al.  Generating example data for dataflow programs , 2009, SIGMOD Conference.

[9]  Timos K. Sellis,et al.  Optimizing ETL processes in data warehouses , 2005, 21st International Conference on Data Engineering (ICDE'05).

[10]  Kenneth Baclawski,et al.  Quickly generating billion-record synthetic databases , 1994, SIGMOD '94.

[11]  Ryan Wisnesky,et al.  Orchid: Integrating Schema Mapping and ETL , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Panos Vassiliadis,et al.  A taxonomy of ETL activities , 2009, DOLAP.

[13]  Kevin Wilkinson,et al.  Optimizing ETL workflows for fault-tolerance , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[14]  Boualem Benatallah,et al.  A Top-Down Petri Net-Based Approach for Dynamic Workflow Modeling , 2003, Business Process Management.

[15]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[16]  Wolfgang Lehner,et al.  Cost-Based Vectorization of Instance-Based Integration Processes , 2009, ADBIS.