Generating data quality rules and integration into ETL process

Many data quality projects are integrated into data warehouse projects without enough time allocated for the data quality part, which leads to a need for a quicker data quality process implementation that can be easily adopted as the first stage of data warehouse implementation. We will see that many data quality rules can be implemented in a similar way, and thus generated based on metadata tables that store information about the rules. These generated rules are then used to check data in designated tables and mark erroneous records, or to do certain updates of invalid data. We will also store information about the rules violations in order to provide analysis of such data. This could give a significant insight into our source systems. Entire data quality process will be integrated into ETL process in order to achieve load of data warehouse that is as automated, as correct and as quick as possible. Only small number of records would be left for manual inspection and reprocessing.

[1]  Coral Calero,et al.  Information and Database Quality , 2002, Advances in Database Systems.

[2]  José Farinha,et al.  A Data Quality Metamodel Extension to CWM , 2007, APCCM.

[3]  Theodore Johnson,et al.  Exploratory Data Mining and Data Cleaning , 2003 .

[4]  Richard Y. Wang,et al.  Data quality assessment , 2002, CACM.

[5]  Dongwon Lee,et al.  Parallel linkage , 2007, CIKM '07.

[6]  David Taniar,et al.  Progressive Methods in Data Warehousing and Business Intelligence: Concepts and Competitive Analytics , 2009 .

[7]  Richard Y. Wang,et al.  Data Quality Assessment , 2002 .

[8]  Carlos Ordonez,et al.  Referential integrity quality metrics , 2008, Decis. Support Syst..

[9]  S. Muthukrishnan,et al.  Checks and Balances , 2003 .

[10]  Matthias Jarke,et al.  Systematic Development of Data Mining-Based Data Quality Tools , 2003, VLDB.

[11]  Jennifer Widom,et al.  A First Course in Database Systems , 1997 .

[12]  Tao Yang,et al.  Record Linkage as DNA Sequence Alignment Problem , 2008, QDB/MUD.

[13]  Jack E. Olson,et al.  Data Quality: The Accuracy Dimension , 2003 .

[14]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[15]  Sam Kash Kachigan Multivariate statistical analysis: A conceptual introduction , 1982 .

[16]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[17]  Ronald G. Ross,et al.  Principles of the business rule approach: Ronald G. Ross, Addison-Wesley Information Technology Series, February 2003, 256pp., price £30.99, ISBN 0-201-78893-4 , 2004, Int. J. Inf. Manag..

[18]  Divesh Srivastava,et al.  Group Linkage , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[19]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[20]  Dennis Shasha,et al.  AJAX: an extensible data cleaning tool , 2000, SIGMOD '00.