This paper proposes a method on solving the problem of dealing with dirty data in the database. Considering the complexity of the structure of the data, based on the previous methods that work on this problem, our method combines the methods that use regular expression and methods that use conditional functional dependencies, to complete the data quality improvement. This method uses dependencies to improve the repairing speed and the searching time on the data. The repairing based on the regular expression is regular while there exist questions that the repairing efficient is influenced by the amount of data. When dealing with the database from company Standard Solution Group (SSG) which is from the reality world data, we have tried other related methods and inspired by these methods, we propose this method. The experiments on the data from SSG shows that this method is much efficient.
[1]
Sunil Prabhakar,et al.
ERACER: a database approach for statistical inference and data cleaning
,
2010,
SIGMOD Conference.
[2]
Wenfei Fan,et al.
Conditional Functional Dependencies for Data Cleaning
,
2007,
2007 IEEE 23rd International Conference on Data Engineering.
[3]
Zeyu Li,et al.
Repairing Data through Regular Expressions
,
2016,
Proc. VLDB Endow..
[4]
Renée J. Miller,et al.
Discovering data quality rules
,
2008,
Proc. VLDB Endow..
[5]
Hannu Toivonen,et al.
TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies
,
1999,
Comput. J..
[6]
Ramakrishnan Srikant,et al.
Fast Algorithms for Mining Association Rules in Large Databases
,
1994,
VLDB.