Human-in-the-loop Rule Learning for Data Integration

Rule-based data integration approaches are widely adopted due to its better interpretability and effective interactive debugging. However, it is very challenging to generate high-quality rules for data integration tasks. Hand-crafted rules from domain experts are usually reliable, but they are not scalable: it is time and effort consuming to handcraft many rules with large coverage over the data. On the other hand, weak-supervision rules automatically generated from machines, such as distant supervision rules, can largely cover the items; however, they may be very noisy that provide many wrong results. To address the problem, we propose a human-in-the-loop rule learning approach with high coverage and high quality. The approach first generates a set of candidate rules, and proposes a machine-based method to learn a confidence for each rule using generative adversarial networks. Then, it devises a gamebased crowdsourcing framework to refine the rules, and develops a budget-constraint crowdsourcing algorithm for rule refinement at affordable cost. Finally, it applies the rules to produce high-quality data integration results.

[1]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[2]  Peng Zhang,et al.  IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models , 2017, SIGIR.

[3]  Christopher M. Bishop,et al.  Pattern recognition and machine learning, 5th Edition , 2007, Information science and statistics.

[4]  Nan Tang,et al.  Proof positive and negative in data cleaning , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[5]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Beng Chin Ooi,et al.  iCrowd: An Adaptive Crowdsourcing Framework , 2015, SIGMOD Conference.

[8]  Jianzhong Li,et al.  Towards certain fixes with editing rules and master data , 2010, The VLDB Journal.

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[11]  Guoliang Li,et al.  A Novel Cost-Based Model for Data Repairing , 2017, IEEE Transactions on Knowledge and Data Engineering.

[12]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[13]  Surajit Chaudhuri,et al.  Learning String Transformations From Examples , 2009, Proc. VLDB Endow..

[14]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[15]  Jian Li,et al.  Cleaning Relations Using Knowledge Bases , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[16]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[17]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[18]  Frederick Reiss,et al.  Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! , 2013, EMNLP.

[19]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[20]  Beng Chin Ooi,et al.  A hybrid machine-crowdsourcing system for matching web tables , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[21]  Nan Tang,et al.  Towards dependable data repairing with fixing rules , 2014, SIGMOD Conference.

[22]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[23]  AnHai Doan,et al.  Why Big Data Industrial Systems Need Rules and What We Can Do About It , 2015, SIGMOD Conference.