Data Quality Problems in ETL: The State of the Practice in Large Organisations

This paper presents a review of the data quality problems that arise because of Extract, Transform and Load (ETL) technology in large organisations by observing the context in which the ETL is deployed. Using a case study methodology, information about the data quality problems and their context arising from deployments in six large organisations is reported. The findings indicate that ETL deployments most commonly introduce data accessibility problems which are caused by (1) the ETL failing part way and not delivering the data on time, (2) the information systems being locked during ETL execution, and (3) users not being able to find data in the target because of errors in the way the primary keys are transformed. Furthermore, accuracy, timeliness, believability, and representational consistency problems were also found to be caused by the ETL technology.

[1]  I. Yeoman Competing on analytics: The new science of winning , 2009 .

[2]  Vipin Swarup,et al.  Everybody Share: The Challenge of Data-Sharing Systems , 2008, Computer.

[3]  Panos Vassiliadis A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[4]  Mirta Baranovic,et al.  Generating data quality rules and integration into ETL process , 2009, DOLAP.

[5]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[6]  Kevin Wilkinson,et al.  Optimizing ETL workflows for fault-tolerance , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[7]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[8]  Timos K. Sellis,et al.  Optimizing ETL processes in data warehouses , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[10]  Torben Bach Pedersen,et al.  A Survey of Open Source Tools for Business Intelligence , 2011, Integrations of Data Warehousing, Data Mining and Database Technologies.

[11]  Martin Oberhofer,et al.  Industrializing Data Integration Projects using a Metadata Driven Assembly Line , 2012, it Inf. Technol..

[12]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[13]  R. Yin Case Study Research: Design and Methods , 1984 .