Automatic weighted matching rectifying rule discovery for data repairing

Data repairing is a key problem in data cleaning which aims to uncover and rectify data errors. Traditional methods depend on data dependencies to check the existence of errors in data, but they fail to rectify the errors. To overcome this limitation, recent methods define repairing rules on which they depend to detect and fix errors. However, all existing data repairing rules are provided by experts which is an expensive task in time and effort. Besides, rule-based data repairing methods need an external verified data source or user verifications; otherwise, they are incomplete where they can repair only a small number of errors. In this paper, we define weighted matching rectifying rules (WMRRs) based on similarity matching to capture more errors. We propose a novel algorithm to discover WMRRs automatically from dirty data in-hand. We also develop an automatic algorithm for rules inconsistency resolution. Additionally, based on WMRRs, we propose an automatic data repairing algorithm (WMRR-DR) which uncovers a large number of errors and rectifies them dependably. We experimentally verify our method on both real-life and synthetic data. The experimental results prove that our method can discover effective WMRRs from dirty data in-hand and perform dependable and full-automatic repairing based on the discovered WMRRs, with higher accuracy than the existing dependable methods.

[1]  Lukasz Golab,et al.  Sampling from repairs of conditional functional dependency violations , 2014, The VLDB Journal.

[2]  Ahmed K. Elmagarmid,et al.  Guided data repair , 2011, Proc. VLDB Endow..

[3]  Paolo Papotti,et al.  Holistic data cleaning: Putting violations into context , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[4]  GetoorLise,et al.  Hinge-loss Markov random fields and probabilistic soft logic , 2017 .

[5]  Jian Li,et al.  Distilling relations using knowledge bases , 2018, The VLDB Journal.

[6]  Lukasz Golab,et al.  Sampling the repairs of functional dependency violations under hard constraints , 2010, Proc. VLDB Endow..

[7]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[8]  William E. Winkler,et al.  Data quality and record linkage techniques , 2007 .

[9]  Hongzhi Wang,et al.  An effective weighted rule-based method for entity resolution , 2018, Distributed and Parallel Databases.

[10]  Paolo Papotti,et al.  Interactive and Deterministic Data Cleaning , 2016, SIGMOD Conference.

[11]  Shuai Ma,et al.  Interaction between Record Matching and Data Repairing , 2014, JDIQ.

[12]  Nan Tang,et al.  Proof positive and negative in data cleaning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[14]  Jan Chomicki,et al.  Consistent query answers in inconsistent databases , 1999, PODS '99.

[15]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[16]  Laks V. S. Lakshmanan,et al.  On approximating optimum repairs for functional dependency violations , 2009, ICDT '09.

[17]  Guoliang Li,et al.  A Novel Cost-Based Model for Data Repairing , 2017, IEEE Transactions on Knowledge and Data Engineering.

[18]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[19]  Christopher Ré,et al.  Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS , 2011, Proc. VLDB Endow..

[20]  Christopher Ré,et al.  The HoloClean Framework Dataset to be cleaned Denial Constraints External Information t 1 t 4 t 2 t 3 Johnnyo ’ s , 2017 .

[21]  Wenfei Fan,et al.  Conditional Functional Dependencies for Data Cleaning , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[22]  Wenfei Fan,et al.  Dependencies revisited for improving data quality , 2008, PODS.

[23]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[24]  Stephen H. Bach,et al.  Hinge-Loss Markov Random Fields and Probabilistic Soft Logic , 2015, J. Mach. Learn. Res..

[25]  Jianzhong Li,et al.  Rule-Based Method for Entity Resolution , 2015, IEEE Transactions on Knowledge and Data Engineering.

[26]  FanWenfei,et al.  Towards certain fixes with editing rules and master data , 2010, VLDB 2010.

[27]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[28]  Rajeev Rastogi,et al.  A cost-based model and effective heuristic for repairing constraints by value modification , 2005, SIGMOD '05.

[29]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[30]  Jeffrey Heer,et al.  Predictive Interaction for Data Transformation , 2015, CIDR.

[31]  Shuai Ma,et al.  Improving Data Quality: Consistency and Accuracy , 2007, VLDB.

[32]  Paolo Papotti,et al.  Generating Concise Entity Matching Rules , 2017, SIGMOD Conference.

[33]  Hong Cheng,et al.  Discovering Conditional Matching Rules , 2017, ACM Trans. Knowl. Discov. Data.

[34]  Ahmed K. Elmagarmid,et al.  Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes , 2013, SIGMOD '13.