论文信息 - A rule based taxonomy of dirty data.

A rule based taxonomy of dirty data.

There is a growing awareness that high quality of data is a key to today’s business success and that dirty data existing within data sources is one of the causes of poor data quality. To ensure high quality data, enterprises need to have a process, methodologies and resources to monitor, analyze and maintain the quality of data. Nevertheless, research shows that many enterprises do not pay adequate attention to the existence of dirty data and have not applied useful methodologies to ensure high quality data for their applications. One of the reasons is a lack of appreciation of the types and extent of dirty data. In practice, detecting and cleaning all the dirty data that exists in all data sources is quite expensive and unrealistic. The cost of cleaning dirty data needs to be considered for most of enterprises. This problem has not attracted enough attention from researchers. In this paper, a rule-based taxonomy of dirty data is developed. The proposed taxonomy not only provides a mechanism to deal with this problem but also includes more dirty data types than any of existing such taxonomies.

[1] Panagiotis G. Ipeirotis,et al. Duplicate Record Detection: A Survey , 2007 .

[2] R. P. Srivastava,et al. A conceptual framework and belief‐function approach to assessing overall information quality , 2003, Int. J. Intell. Syst..

[3] Doheon Lee,et al. A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[4] Tok Wang Ling,et al. IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[5] Won Kim. On Three Major Holes in Data Warehousing Today , 2002, J. Object Technol..

[6] Matthias Jarke,et al. Architecture and Quality in Data Warehouses: An Extended Repository Approach , 1999, Information Systems.

[7] Heiko Mueller,et al. Problems , Methods , and Challenges in Comprehensive Data Cleansing , 2005 .

[8] Erhard Rahm,et al. Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[9] Helena Galhardas,et al. A Taxonomy of Data Quality Problems , 2005 .