AN APPROACH FOR TESTING THE EXTRACT-TRANSFORM-LOAD PROCESS IN DATA WAREHOUSE SYSTEMS Submitted

The Extract-Transform-Load (ETL) process in data warehousing involves extracting data from source databases, transforming it into a form suitable for research and analysis, and loading it into a data warehouse. ETL processes can use complex transformations involving sources and targets that use different schemas, databases, and technologies, which make ETL implementations fault-prone. In this paper, we present an approach for validating ETL processes using automated balancing tests that check for various types of discrepancies between the source and target data. We formalize three categories of properties, namely, completeness, consistency, and syntactic validity that must be checked during testing. Our approach uses the rules provided in the ETL specifications to generate source-to-target mappings, from which balancing test assertions are generated for each property. We evaluated the approach on a real-world health data warehouse project and revealed 11 previously undetected faults. Using mutation analysis, we demonstrated that our auto-generated assertions can detect faults in the data inside the target data warehouse.

[1]  Mark Harman,et al.  An Analysis and Survey of the Development of Mutation Testing , 2011, IEEE Transactions on Software Engineering.

[2]  Mark A. Weiss Data Structures & Algorithm Analysis in C++ , 2012 .

[3]  Elaine J. Weyuker,et al.  An AGENDA for testing relational database applications , 2004, Softw. Test. Verification Reliab..

[4]  Ryan Wisnesky,et al.  Orchid: Integrating Schema Mapping and ETL , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  Donald R. Slutz,et al.  Massive Stochastic Testing of SQL , 1998, VLDB.

[6]  Mark Harman,et al.  The Oracle Problem in Software Testing: A Survey , 2015, IEEE Transactions on Software Engineering.

[7]  David Loshin,et al.  Rule-based data quality , 2002, CIKM '02.

[8]  José Barateiro,et al.  A Survey of Data Quality Tools , 2005, Datenbank-Spektrum.

[9]  Alberto L. Sangiovanni-Vincentelli,et al.  A methodology for correct-by-construction latency insensitive design , 1999, 1999 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (Cat. No.99CH37051).

[10]  Xin Bai Testing the Performance of an SSAS Cube Using VSTS , 2010, 2010 Seventh International Conference on Information Technology: New Generations.

[11]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[12]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[13]  Larissa Terpeluk Moss,et al.  Business Intelligence Roadmap: The Complete Project Lifecycle for Decision-Support Applications , 2003 .

[14]  Abdelmgeid A. Ali,et al.  Automated ETL Testing on the Data Quality of a Data Warehouse , 2015 .

[15]  Pedro Nuno San-Banto Furtado Evolving Application Domains of Data Warehousing and Mining: Trends and Solutions , 2009 .

[16]  George Papastefanatos,et al.  Design Metrics for Data Warehouse Evolution , 2008, ER.

[17]  Hyoil Han,et al.  XML-OLAP: A Multidimensional Analysis Framework for XML Warehouses , 2005, DaWaK.

[18]  Steven G. Johnson,et al.  A Harmonized Data Quality Assessment Terminology and Framework for the Secondary Use of Electronic Health Record Data , 2016, EGEMS.

[19]  Vincent Rainardi,et al.  Building a Data Warehouse: With Examples in SQL Server , 2008 .

[20]  Gregg Rothermel,et al.  An empirical study of regression test selection techniques , 2001, ACM Trans. Softw. Eng. Methodol..

[21]  Mladen Varga On the Differences of Relational and Dimensional Data Model , 2001 .

[22]  Qing Li,et al.  Star/Snow-Flake Schema Driven Object-Relational Data Warehouse - Design and Query Processing Strategies , 1999, DaWaK.

[23]  M. Pamela Neely,et al.  Data Quality Tools for Data Warehousing: A Small Sample Survey , 1998, IQ.

[24]  Christer Carlsson,et al.  Past, present, and future of decision support technology , 2002, Decis. Support Syst..

[25]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[26]  Matteo Golfarelli,et al.  A comprehensive approach to data warehouse testing , 2009, DOLAP.

[27]  Lionel C. Briand,et al.  Automating impact analysis and regression test selection based on UML designs , 2002, International Conference on Software Maintenance, 2002. Proceedings..

[28]  Galal H. Galal-Edeen,et al.  Data warehouse testing , 2013, EDBT '13.

[29]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[30]  Jerry Zeyu Gao,et al.  Big Data Validation and Quality Assurance -- Issuses, Challenges, and Needs , 2016, 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE).

[31]  Andrea C. Arpaci-Dusseau,et al.  FATE and DESTINI: A Framework for Cloud Recovery Testing , 2011, NSDI.

[32]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[33]  Marius Marin,et al.  A Data-Agnostic Approach to Automatic Testing of Multi-dimensional Databases , 2014, 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation.

[34]  JaatunMartin Gilje,et al.  Agile Software Development , 2002, Comput. Sci. Educ..

[35]  Andreas Reuter,et al.  Principles of transaction-oriented database recovery , 1983, CSUR.

[36]  N. ElGamal,et al.  Towards a data warehouse testing framework , 2012, 2011 Ninth International Conference on ICT and Knowledge Engineering.

[37]  Lawrence Corr,et al.  Agile Data Warehouse Design: Collaborative Dimensional Modeling, from Whiteboard to Star Schema , 2011 .

[38]  Timos K. Sellis,et al.  Ontology-Driven Conceptual Design of ETL Processes Using Graph Transformations , 2009, J. Data Semant..

[39]  Matteo Golfarelli,et al.  Data Warehouse Testing , 2011, Int. J. Data Warehous. Min..

[40]  Christopher Hass,et al.  A Practical Guide to Usability Testing , 2019, Consumer Informatics and Digital Health.

[41]  Ilene Burnstein,et al.  Practical Software Testing: A Process-Oriented Approach , 2003 .

[42]  E. Berner,et al.  Clinical Decision Support Systems: Theory and Practice , 1998 .

[43]  Peter Gluchowski,et al.  Data Warehouse , 1997, Informatik-Spektrum.

[44]  Mario Piattini,et al.  Towards Data Warehouse Quality Metrics , 2001, DMDW.

[45]  A Depeursinge,et al.  Clinical Data Mining: a Review , 2009, Yearbook of Medical Informatics.

[46]  Barbara J. Grosz,et al.  Natural-Language Processing , 1982, Artificial Intelligence.

[47]  P. Ryan,et al.  Transforming the Premier Perspective® Hospital Database into the Observational Medical Outcomes Partnership (OMOP) Common Data Model , 2014, EGEMS.

[48]  Hongjun Lu,et al.  Cleansing Data for Mining and Warehousing , 1999, DEXA.

[49]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[50]  OBAS: An OLAP Benchmark for Analysis Services , 2013, J. Inf. Data Manag..

[51]  Daniel Pol,et al.  Principles for an ETL Benchmark , 2009, TPCTC.

[52]  George Colliat,et al.  OLAP, relational, and multidimensional database systems , 1996, SGMD.

[53]  Matteo Golfarelli,et al.  Data Warehouse Design: Modern Principles and Methodologies , 2009 .

[54]  Phil McMinn,et al.  Search-Based Testing of Relational Schema Integrity Constraints Across Multiple Database Management Systems , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.