Bayesian Data Cleaning for Web Data

Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies (CFDs) to rectify data. These methods learn data patterns--CFDs--from a clean sample of the data and use them to rectify the dirty/inconsistent data. While getting a clean training sample is feasible in enterprise data scenarios, it is infeasible in web databases where there is no separate curated data. CFD based methods are unfortunately particularly sensitive to noise; we will empirically demonstrate that the number of CFDs learned falls quite drastically with even a small amount of noise. In order to overcome this limitation, we propose a fully probabilistic framework for cleaning data. Our approach involves learning both the generative and error (corruption) models of the data and using them to clean the data. For generative models, we learn Bayes networks from the data. For error models, we consider a maximum entropy framework for combing multiple error processes. The generative and error models are learned directly from the noisy data. We present the details of the framework and demonstrate its effectiveness in rectifying web data.

[1]  Alexes Butler,et al.  Microsoft Research Cambridge , 2013 .

[2]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[3]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[4]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Andrew W. Moore,et al.  Probabilistic noise identification and data cleaning , 2003, Third IEEE International Conference on Data Mining.

[6]  Renée J. Miller,et al.  Discovering data quality rules , 2008, Proc. VLDB Endow..

[7]  References , 1971 .

[8]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, Proc. VLDB Endow..

[9]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[10]  Chian-Huei Wun,et al.  Using association rules for completing missing data , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[11]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Laks V. S. Lakshmanan,et al.  Data Cleaning and Query Answering with Matching Dependencies and Matching Functions , 2010, ICDT '11.

[13]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[14]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.