Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation

Multi-source entity linkage focuses on integrating knowledge from multiple sources by linking the records that represent the same real world entity. This is critical in high-impact applications such as data cleaning and user stitching. The state-of-the-art entity linkage pipelines mainly depend on supervised learning that requires abundant amounts of training data. However, collecting well-labeled training data becomes expensive when the data from many sources arrives incrementally over time. Moreover, the trained models can easily overfit to specific data sources, and thus fail to generalize to new sources due to significant differences in data and label distributions. To address these challenges, we present AdaMEL, a deep transfer learning framework that learns generic high-level knowledge to performmulti-source entity linkage.AdaMELmodels the attribute importance that is used to match entities through an attribute-level self-attention mechanism, and leverages the massive unlabeled data from new data sources through domain adaptation to make it generic and data-source agnostic. In addition, AdaMEL is capable of incorporating an additional set of labeled data to more accurately integrate data sources with different attribute importance. Extensive experiments show that our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning. Besides, it is more stable in handling different sets of data sources in less runtime. PVLDB Reference Format: Di Jin, Bunyamin Sisman, Hao Wei, Xin Luna Dong, and Danai Koutra. Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation. PVLDB, 14(1): XXX-XXX, 2020.

[1]  Prithviraj Sen,et al.  Active Learning for Large-Scale Entity Resolution , 2017, CIKM.

[2]  Shiliang Sun,et al.  A survey of multi-source domain adaptation , 2015, Inf. Fusion.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Paolo Papotti,et al.  Synthesizing Entity Matching Rules by Examples , 2017, Proc. VLDB Endow..

[5]  Xianpei Han,et al.  Hierarchical Matching Network for Heterogeneous Entity Resolution , 2020, IJCAI.

[6]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[7]  Felix Naumann,et al.  Data Fusion – Resolving Data Conflicts for Integration , 2009 .

[8]  Mark Heimann,et al.  node2bits: Compact Time- and Attribute-aware Node Representations for User Stitching , 2019, ECML/PKDD.

[9]  Alon Y. Halevy,et al.  Semantic Integration Research in the Database Community : A Brief Survey , 2005 .

[10]  Diane J. Cook,et al.  A Survey of Unsupervised Deep Domain Adaptation , 2018, ACM Trans. Intell. Syst. Technol..

[11]  W. Tan,et al.  Deep entity matching with pre-trained language models , 2020, Proc. VLDB Endow..

[12]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[13]  Shafiq R. Joty,et al.  Feature space of DT Featu re space of DS Feature Truncation Feature Standardization , 2018 .

[14]  Jungo Kasai,et al.  Low-resource Deep Entity Resolution with Transfer and Active Learning , 2019, ACL.

[15]  Jianzhong Li,et al.  Reasoning about Record Matching Rules , 2009, Proc. VLDB Endow..

[16]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[17]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[18]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[19]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[20]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[21]  Victor S. Lempitsky,et al.  Unsupervised Domain Adaptation by Backpropagation , 2014, ICML.

[22]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[23]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[24]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[25]  Gunnar Rätsch,et al.  An Empirical Analysis of Domain Adaptation Algorithms for Genomic Sequence Analysis , 2008, NIPS.

[26]  Shafiq R. Joty,et al.  Distributed Representations of Tuples for Entity Resolution , 2018, Proc. VLDB Endow..

[27]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Ivor W. Tsang,et al.  This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 1 Domain Adaptation from Multiple Sources: A Domain- , 2022 .

[30]  Yu Zhang,et al.  Transfer Learning via Learning to Transfer , 2018, ICML.

[31]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Xianpei Han,et al.  Deep Sequence-to-Sequence Entity Matching for Heterogeneous Entity Resolution , 2019, CIKM.