Data quality in ETL process: A preliminary study

Abstract The accuracy and relevance of Business Intelligence & Analytics (BI&A) rely on the ability to bring high data quality to the data warehouse from both internal and external sources using the ETL process. The latter is complex and time-consuming as it manages data with heterogeneous content and diverse quality problems. Ensuring data quality requires tracking quality defects along the ETL process. In this paper, we present the main ETL quality characteristics. We provide an overview of the existing ETL process data quality approaches. We also present a comparative study of some commercial ETL tools to show how much these tools consider data quality dimensions. To illustrate our study, we carry out experiments using an ETL dedicated solution (Talend Data Integration) and a data quality dedicated solution (Talend Data Quality). Based on our study, we identify and discuss quality challenges to be addressed in our future research.

[1]  Miryung Kim,et al.  Data Scientists in Software Teams: State of the Art and Challenges , 2018, IEEE Transactions on Software Engineering.

[2]  Stephen R. Gardner Building the data warehouse , 1998, CACM.

[3]  Vasileios Theodorou,et al.  Data generator for evaluating ETL process quality , 2017, Inf. Syst..

[4]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[5]  Marta Indulska,et al.  Open data: Quality over quantity , 2017, Int. J. Inf. Manag..

[6]  Jaiteg Singh,et al.  An Open Source ETL Tool - Medium and Small Scale Enterprise ETL(MaSSEETL) , 2014 .

[7]  F. Boufares,et al.  Heterogeneous data-integration and data quality: Overview of conflicts , 2012, 2012 6th International Conference on Sciences of Electronics, Technologies of Information and Telecommunications (SETIT).

[8]  Herbert Kuchen,et al.  Efficiency evaluation of open source ETL tools , 2011, SAC.

[9]  Faouzi Boufares,et al.  Semantic Recognition of a Data Structure in Big-Data , 2014 .

[10]  Maurice Kügler,et al.  The impact of data quality and analytical capabilities on planning performance: insights from the automotive industry , 2011, Wirtschaftsinformatik.

[11]  Stefan Müller,et al.  Pentaho Data Integration , 2014 .

[12]  Panos Vassiliadis,et al.  A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..

[13]  Veda C. Storey,et al.  Business Intelligence and Analytics: From Big Data to Big Impact , 2012, MIS Q..

[14]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[15]  Fatma Abdelhédi,et al.  MDA-Based Approach for NoSQL Databases Modelling , 2017, DaWaK.

[16]  Hugh J. Watson Business Intelligence: Past, Present and Future , 2009, AMCIS.

[17]  Michael L. Brodie,et al.  The meaningful use of big data: four perspectives -- four challenges , 2012, SGMD.

[18]  Felix Naumann,et al.  Profiling relational data: a survey , 2015, The VLDB Journal.

[19]  Andrea L. Bertozzi,et al.  Unsupervised record matching with noisy and incomplete data , 2017, International Journal of Data Science and Analytics.

[20]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[21]  Maurizio Vincini,et al.  A semantic approach to ETL technologies , 2011, Data Knowl. Eng..

[22]  Lei Jiang,et al.  Data Quality Is Context Dependent , 2010, BIRTE.

[23]  Gregor Engels,et al.  Context-specific Quality Evaluation of Test Cases , 2018, MODELSWARD.

[24]  Syed Muhammad Fawad Ali,et al.  Next-generation ETL Framework to Address the Challenges Posed by Big Data , 2018, DOLAP.

[25]  Matteo Golfarelli,et al.  From Star Schemas to Big Data: 20+ Years of Data Warehouse Research , 2018, A Comprehensive Guide Through the Italian Database Research.

[26]  Adir Even,et al.  Data quality assessment in context: A cognitive perspective , 2009, Decis. Support Syst..

[27]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[28]  Mouzhi Ge,et al.  Guildlines of Data Quality Issues for Data Integration in the Context of the TPC-DI Benchmark , 2017, ICEIS.

[29]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[30]  Wolfgang Lehner,et al.  Quality measures for ETL processes: from goals to implementation , 2016, Concurr. Comput. Pract. Exp..

[31]  Nikhil Debbarma,et al.  Analysis of Data Quality and Performance Issues in Data Warehousing and Business Intelligence , 2013 .

[32]  Matteo Golfarelli,et al.  Variety-Aware OLAP of Document-Oriented Databases , 2018, DOLAP.