Low-resource Deep Entity Resolution with Transfer and Active Learning

Entity resolution (ER) is the task of identifying different representations of the same real-world entities across databases. It is a key step for knowledge base creation and text mining. Recent adaptation of deep learning methods for ER mitigates the need for dataset-specific feature engineering by constructing distributed representations of entity records. While these methods achieve state-of-the-art performance over benchmark data, they require large amounts of labeled data, which are typically unavailable in realistic ER applications. In this paper, we develop a deep learning-based method that targets low-resource settings for ER through a novel combination of transfer learning and active learning. We design an architecture that allows us to learn a transferable model from a high-resource setting to a low-resource one. To further adapt to the target dataset, we incorporate active learning that carefully selects a few informative examples to fine-tune the transferred model. Empirical evaluation demonstrates that our method achieves comparable, if not better, performance compared to state-of-the-art learning-based methods while using an order of magnitude fewer labels.

[1]  Hongfei Yan,et al.  Group based Self Training for E-Commerce Product Record Linkage , 2014, COLING.

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[4]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[5]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[6]  Benjamin I. P. Rubinstein,et al.  Scaling multiple-source entity resolution using statistically efficient transfer learning , 2012, CIKM.

[7]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Peter Christen,et al.  Febrl: a freely available record linkage system with a graphical user interface , 2008 .

[10]  Jeffrey Xu Yu,et al.  Entity Matching: How Similar Is Similar , 2011, Proc. VLDB Endow..

[11]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[12]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[13]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[14]  Gisele L. Pappa,et al.  Active Learning Genetic programming for record deduplication , 2010, IEEE Congress on Evolutionary Computation.

[15]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[16]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[17]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[18]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[19]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[20]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[21]  Prithviraj Sen,et al.  Active Learning for Large-Scale Entity Resolution , 2017, CIKM.

[22]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[23]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[24]  Laura M. Haas,et al.  Clio: Schema Mapping Creation and Data Exchange , 2009, Conceptual Modeling: Foundations and Applications.

[25]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[26]  Robert Isele,et al.  Active learning of expressive linkage rules using genetic programming , 2013, J. Web Semant..

[27]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[28]  Shafiq R. Joty,et al.  Feature space of DT Featu re space of DS Feature Truncation Feature Standardization , 2018 .

[29]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[30]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[31]  Aditya G. Parameswaran,et al.  Active sampling for entity matching , 2012, KDD.

[32]  Ruimao Zhang,et al.  Cost-Effective Active Learning for Deep Image Classification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[33]  Andreas Thor,et al.  Evaluation of entity resolution approaches on real-world match problems , 2010, Proc. VLDB Endow..

[34]  Shafiq R. Joty,et al.  Distributed Representations of Tuples for Entity Resolution , 2018, Proc. VLDB Endow..

[35]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[36]  William M. Campbell,et al.  Cross-Domain Entity Resolution in Social Media , 2016, ArXiv.