Improving the Efficiency and Effectiveness for BERT-based Entity Resolution

BERT has set a new state-of-the-art performance on entity resolution (ER) task, largely owed to fine-tuning pretrained language models and the deep pair-wise interaction. Albeit being remarkably effective, it comes with a steep increase in computational cost, as the deep-interaction requires to exhaustively compute every tuple pair to search for coreferences. For ER task, it is often prohibitively expensive due to the large cardinality to be matched. To tackle this, we introduce a siamese network structure that independently encodes tuples using BERT but delays the pair-wise interaction via an enhanced alignment network. This siamese structure enables a dedicated blocking module to quickly filter out obviously dissimilar tuple pairs, and thus drastically reduces the cardinality of fine-grained matching. Further, the blocking and entity matching are integrated into a multi-task learning framework for facilitating both tasks. Extensive experiments on multiple datasets demonstrate that our model significantly outperforms state-of-the-art models (including BERT) in both efficiency and effectiveness.

[1]  AnHai Doan,et al.  Technical Perspective:: Toward Building Entity Matching Management Systems , 2016, SGMD.

[2]  Divesh Srivastava,et al.  Online Entity Resolution Using an Oracle , 2016, Proc. VLDB Endow..

[3]  David G. Lowe,et al.  Scalable Nearest Neighbor Algorithms for High Dimensional Data , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[5]  Christos Faloutsos,et al.  AutoBlock: A Hands-off Blocking Framework for Entity Matching , 2020, WSDM.

[6]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[7]  George Papastefanatos,et al.  Schema-agnostic vs Schema-based Configurations for Blocking Methods on Homogeneous Data , 2015, Proc. VLDB Endow..

[8]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[9]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Yi Wang,et al.  GraphER: Token-Centric Entity Resolution with Graph Convolutional Neural Networks , 2020, AAAI.

[12]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Shiguang Shan,et al.  Deep Supervised Hashing for Fast Image Retrieval , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[18]  Avigdor Gal,et al.  MFIBlocks: An effective blocking algorithm for entity resolution , 2013, Inf. Syst..

[19]  Shuohang Wang,et al.  A Compare-Aggregate Model for Matching Text Sequences , 2016, ICLR.

[20]  Stephen E. Fienberg,et al.  A Comparison of Blocking Methods for Record Linkage , 2014, Privacy in Statistical Databases.

[21]  Qing Wang,et al.  Semantic-Aware Blocking for Entity Resolution , 2016, IEEE Transactions on Knowledge and Data Engineering.

[22]  Zhiguo Wang,et al.  Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[23]  Paolo Papotti,et al.  Synthesizing Entity Matching Rules by Examples , 2017, Proc. VLDB Endow..

[24]  Divesh Srivastava,et al.  Big data integration , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[25]  Jian Sun,et al.  Optimized Product Quantization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[27]  Shafiq R. Joty,et al.  Distributed Representations of Tuples for Entity Resolution , 2018, Proc. VLDB Endow..

[28]  Xianpei Han,et al.  End-to-End Multi-Perspective Matching for Entity Resolution , 2019, IJCAI.

[29]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[30]  Craig A. Knoblock,et al.  Learning Blocking Schemes for Record Linkage , 2006, AAAI.

[31]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[32]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[33]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[34]  Jiwen Lu,et al.  Deep Hashing via Discrepancy Minimization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[36]  George Papadakis,et al.  Blocking and Filtering Techniques for Entity Resolution , 2019, ACM Comput. Surv..

[37]  Alexander I. Rudnicky,et al.  Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding , 2015, ACL.

[38]  W. Tan,et al.  Deep entity matching with pre-trained language models , 2020, Proc. VLDB Endow..