Analysis of Data Quality Problem Taxonomies

There are many reasons to maintain high quality data in databases and other structured data sources. High quality data ensures better discovery, automated data analysis, data mining, migration and re-use. However, due to human errors or faults in data systems themselves data can become corrupted. In this paper existing data quality problem taxonomies for structured textual data and several improvements are analysed. A new classification of data quality problems and a framework for detecting data errors both with and without data operator assistance is proposed.

[1]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[2]  Doheon Lee,et al.  A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[3]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[4]  Helena Galhardas,et al.  A Taxonomy of Data Quality Problems , 2005 .

[5]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[6]  Taoxin Peng,et al.  A rule based taxonomy of dirty data. , 2010 .