OPEN USER INVOLVEMENT IN DATA CLEANING FOR DATA WAREHOUSE QUALITY

High quality of data warehouse is a key to make smart strategic decisions. The data cleaning is program that performs to deal with the quality problems of data extracted from operational sources before their loading into data warehouse. As the data cleaning can introduce errors and some data require manually clean, there is a need for an open user involvement in data cleaning for data warehouse quality. This is essential to validate the cleaned data by users and to replace the dirty data in their original sources, and also to correct the poor data that can’t be cleaned automatically. In this paper, we extend the data cleaning and extract-transform-load (ETL) processes to better support the user involvement in data quality management. We proposed that the ETL processes include two phases: the transformation to clean data at the operational data sources and the propagation of data cleaned towards their original sources. The major benefits of our proposal are twofold. First, it is the validation of cleaned data by users. Second, it allows the operational data sources quality improvement. Consequently the user involvement based data cleaning leads to a total data quality management and avoids redoing the same clean for future warehousing.

[1]  Katherine G. Herbert,et al.  Biological data cleaning: a case study , 2007, Int. J. Inf. Qual..

[2]  Viljan Mahnic,et al.  Data quality: A prerequisite for successful data warehouse implementation , 2001, Informatica.

[3]  Pedro Rangel Henriques,et al.  An Ontology-Based Approach for Data Cleaning , 2006, ICIQ.

[4]  Rao R. Nemani,et al.  A Framework for Data Quality in Data Warehousing , 2009, UNISCON.

[5]  Markus Helfert,et al.  Proactive data quality management for data warehouse systems , 2002, DMDW.

[6]  Tamraparni Dasu,et al.  Data Quality Mining: New Research Directions , 2009 .

[7]  Peter J. Haug,et al.  Exploiting missing clinical data in Bayesian network modeling for predicting medical problems , 2008, J. Biomed. Informatics.

[8]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[9]  Shien Lin,et al.  Inforamtion Quality in Engineering Asset Management , 2007 .

[10]  Esther Pacitti,et al.  Update propagation strategies to improve freshness in lazy master replicated databases , 2000, The VLDB Journal.

[11]  Dariusz Matyja Applications of data mining algorithms to analysis of medical data. , 2007 .

[12]  Laure Berti-Équille,et al.  Measuring and Modelling Data Quality for Quality-Awareness in Data Mining , 2007, Quality Measures in Data Mining.

[13]  Mahmoud Boufaïda,et al.  Knowledge Based Data Cleaning for Data Warehouse Quality , 2011, ICDIPC.

[14]  Latif Al-Hakim,et al.  Information Quality Management: Theory and Applications , 2006 .

[15]  Huanzhuo Ye,et al.  An open data cleaning framework based on semantic rules for Continuous Auditing , 2010, 2010 2nd International Conference on Computer Engineering and Technology.

[16]  Emanuel Santos,et al.  Support for User Involvement in Data Cleaning , 2011, DaWaK.

[17]  Esther Pacitti,et al.  Improving Data Freshness in Replicated Databases , 1998 .

[18]  Matthias Jarke,et al.  Data warehouse process management , 2001, Inf. Syst..

[19]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[20]  Panos Vassiliadis A Survey of Extract-Transform-Load Technology , 2009, Int. J. Data Warehous. Min..