A semantic‐aware data generator for ETL workflows

Extract, transform, and load (ETL) processes organized as workflows play an important role in the future data integration for cloud services. ETL designers/administrators need testing data set that is aware of semantics of ETL workflow workloads to evaluate their developed ETL systems. Populating testing ETL systems with meaningful workload data is a difficult task. In this paper, we propose a semantic‐aware data generator for ETL workflows. With given ETL workflow models and workload characterizations, the generator is able to generate synthetic data that capture the semantics of ETL activities. This is carried out by a three‐staged approach. First, we derive expected cardinalities of all the source, intermediate, and target data sets involved in the ETL workflow model with some user‐specified cardinality requirements. Then, with the concept of symbolic test, symbolic data instead of concrete data involved in ETL activities are generated, and semantics of the ETL workflow models are transformed to various constraints over these symbols. At last, concrete data are derived on the basis of resolving constraints. Our generator may facilitate ETL workload test case generation for ETL toolkit performance and function evaluations as well as ETL workflow solution benchmarking. Copyright © 2013 John Wiley & Sons, Ltd.