Complementing Data in the ETL Process

Data quality in a typical Data Warehouse (DW) environment is critical. The process of transferring data from different sources into the DW environment, known as ETL (Extraction, Transformation, and Load), usually takes care of improving the data quality. However, it is not unusual to identify null values in a DW fact table during the ETL process, and this may impact negatively on the accuracy of data analyses results. Data imputation1 techniques are commonly used for dealing with the missing value problem. Some of them observe table values to generate a new value for the missing one. This paper proposes a new strategy to address the missing data problem on the ETL process. The idea is to enrich the DW fact table with dimension attributes, in order to reach better imputation results. The strategy uses the k-NN algorithm as the imputation approach. Tests performed on an implemented prototype showed promising results with respect to imputation quality.

[1]  Ricardo Choren,et al.  Aprimorando Processos de Imputação Multivariada de Dados com Workflows , 2008, SBBD.

[2]  Robert Stevens,et al.  Annotating, Linking and Browsing Provenance Logs for {e-Science} , 2003 .

[3]  Witold Pedrycz,et al.  A Novel Framework for Imputation of Missing Values in Databases , 2007, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[4]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[5]  Xintao Wu,et al.  Learning missing values from summary constraints , 2002, SKDD.

[6]  Sanjeev Khanna,et al.  Why and Where: A Characterization of Data Provenance , 2001, ICDT.

[7]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[8]  Agnes Boskovitz,et al.  Data Editing and Logic: The covering set method from the perspective of logic , 2008 .

[9]  Carole A. Goble,et al.  Using Semantic Web Technologies for Representing E-science Provenance , 2004, SEMWEB.

[10]  Wang Hongwei,et al.  Research and Implementation of QAR Data Warehouse , 2008, 2008 Second International Symposium on Intelligent Information Technology Application.

[11]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[12]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[13]  Matteo Magnani,et al.  A new reparation method for incomplete data in the context of supervised learning , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..