SeMBlock: A semantic-aware meta-blocking approach for entity resolution

Entity resolution refers to the process of identifying, matching, and integrating records belonging to unique entities in a data set. However, a comprehensive comparison across all pairs of records leads to quadratic matching complexity. Therefore, blocking methods are used to group similar entities into small blocks before the matching. Available blocking methods typically do not consider semantic relationships among records. In this paper, we propose a Semantic-aware Meta-Blocking approach called SeMBlock. SeMBlock considers the semantic similarity of records by applying locality-sensitive hashing (LSH) based on word embedding to achieve fast and reliable blocking in a large-scale data environment. To improve the quality of the blocks created, SeMBlock builds a weighted graph of semantically similar records and prunes the graph edges. We extensively compare SeMBlock with 16 existing blocking methods, using three real-world data sets. The experimental results show that SeMBlock significantly outperforms all 16 methods with respect to two relevant measures, F-measure and pair-quality measure. F-measure and pair-quality measure of SeMBlock are approximately 7% and 27%, respectively, higher than recently released blocking methods.

[1]  Shao-Qing Yu,et al.  Entity Resolution with Recursive Blocking , 2020, Big Data Res..

[2]  Rakesh Nagi,et al.  An incremental graph-partitioning algorithm for entity resolution , 2019, Inf. Fusion.

[3]  Georgia Koutrika,et al.  Entity resolution with iterative blocking , 2009, SIGMOD Conference.

[4]  Gerhard Weiss,et al.  ADDI: Recommending alternatives for drug-drug interactions with negative health effects , 2020, Comput. Biol. Medicine.

[5]  George Papastefanatos,et al.  Supervised Meta-blocking , 2014, Proc. VLDB Endow..

[6]  Rajesh Piryani,et al.  A Linguistic Rule-Based Approach for Aspect-Level Sentiment Analysis of Movie Reviews , 2017 .

[7]  Keizo Oyama,et al.  A Fast Linkage Detection Scheme for Multi-Source Information Integration , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[8]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[9]  Markus Stumptner,et al.  Certus: An Effective Entity Resolution Approach with Graph Differential Dependencies (GDDs) , 2019, Proc. VLDB Endow..

[10]  Sonia Bergamaschi,et al.  Schema-Agnostic Progressive Entity Resolution , 2019, IEEE Transactions on Knowledge and Data Engineering.

[11]  Avigdor Gal,et al.  MFIBlocks: An effective blocking algorithm for entity resolution , 2013, Inf. Syst..

[12]  Peter Christen,et al.  A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication , 2012, IEEE Transactions on Knowledge and Data Engineering.

[13]  Qing Wang,et al.  Active Blocking Scheme Learning for Entity Resolution , 2018, PAKDD.

[14]  George Papadakis,et al.  Multi-core Meta-blocking for Big Linked Data , 2017, SEMANTiCS.

[15]  Christopher De Sa,et al.  Incremental Knowledge Base Construction Using DeepDive , 2015, The VLDB Journal.

[16]  Rajesh Piryani,et al.  Generating Aspect-based Extractive Opinion Summary: Drawing Inferences from Social Media Texts , 2018, Computación y Sistemas.

[17]  Claudia Niederée,et al.  A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces , 2013, IEEE Transactions on Knowledge and Data Engineering.

[18]  George Papadakis,et al.  Blocking and Filtering Techniques for Entity Resolution , 2019, ACM Comput. Surv..

[19]  Sonia Bergamaschi,et al.  BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution , 2016, Proc. VLDB Endow..

[20]  Kevin O'Hare,et al.  An unsupervised blocking technique for more efficient record linkage , 2019, Data Knowl. Eng..

[21]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[22]  Vivek Kumar Singh,et al.  Aspect-based sentiment analysis of mobile reviews , 2019, J. Intell. Fuzzy Syst..

[23]  Marcos André Gonçalves,et al.  BLOSS: Effective meta-blocking with almost no effort , 2018, Inf. Syst..

[24]  Peter Fankhauser,et al.  Efficient entity resolution for large heterogeneous information spaces , 2011, WSDM '11.

[25]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[26]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[27]  Carlos Eduardo S. Pires,et al.  A noise tolerant and schema-agnostic blocking technique for entity resolution , 2019, SAC.

[28]  Qing Wang,et al.  A Clustering-Based Framework to Control Block Sizes for Entity Resolution , 2015, KDD.

[29]  Theodoros Rekatsinas,et al.  Deep Learning for Entity Matching: A Design Space Exploration , 2018, SIGMOD Conference.

[30]  Yongtao Ma,et al.  TYPiMatch: type-specific unsupervised learning of keys and key values for heterogeneous web data integration , 2013, WSDM.

[31]  Gerhard Weiss,et al.  Entity resolution in disjoint graphs: An application on genealogical data , 2016, Intell. Data Anal..

[32]  Wei Zhang,et al.  Knowledge vault: a web-scale approach to probabilistic knowledge fusion , 2014, KDD.

[33]  Sanjay Chawla,et al.  Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters , 2011, TKDD.

[34]  George Papastefanatos,et al.  Parallel meta-blocking for scaling entity resolution over big heterogeneous data , 2017, Inf. Syst..

[35]  Wolfgang Nejdl,et al.  Meta-Blocking: Taking Entity Resolutionto the Next Level , 2014, IEEE Transactions on Knowledge and Data Engineering.

[36]  Avigdor Gal,et al.  Comparative Analysis of Approximate Blocking Techniques for Entity Resolution , 2016, Proc. VLDB Endow..

[37]  Huizhi Liang,et al.  Semantic-Aware Blocking for Entity Resolution , 2016, IEEE Trans. Knowl. Data Eng..