Crowdsourced Collective Entity Resolution with Relational Match Propagation

Knowledge bases (KBs) store rich yet heterogeneous entities and facts. Entity resolution (ER) aims to identify entities in KBs which refer to the same real-world object. Recent studies have shown significant benefits of involving humans in the loop of ER. They often resolve entities with pairwise similarity measures over attribute values and resort to the crowds to label uncertain ones. However, existing methods still suffer from high labor costs and insufficient labeling to some extent. In this paper, we propose a novel approach called crowdsourced collective ER, which leverages the relationships between entities to infer matches jointly rather than independently. Specifically, it iteratively asks human workers to label picked entity pairs and propagates the labeling information to their neighbors in distance. During this process, we address the problems of candidate entity pruning, probabilistic propagation, optimal question selection and error-tolerant truth inference. Our experiments on real-world datasets demonstrate that, compared with state-of-the-art methods, our approach achieves superior accuracy with much less labeling.

[1]  Felix Naumann,et al.  Progressive Duplicate Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[2]  Dmitri V. Kalashnikov,et al.  Progressive Approach to Relational Entity Resolution , 2014, Proc. VLDB Endow..

[3]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[4]  Guoliang Li,et al.  A partial-order-based framework for cost-effective crowdsourced entity resolution , 2018, The VLDB Journal.

[5]  Benjamin I. P. Rubinstein,et al.  Principled Graph Matching Algorithms for Integrating Multiple Data Sources , 2014, IEEE Transactions on Knowledge and Data Engineering.

[6]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[7]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[8]  Pascal Hitzler,et al.  The properties of property alignment , 2014, OM.

[9]  Yannis Papakonstantinou,et al.  Waldo: An Adaptive Human Interface for Crowd Entity Resolution , 2017, SIGMOD Conference.

[10]  Wei Hu,et al.  A Bootstrapping Approach to Entity Linkage on the Semantic Web , 2015, J. Web Semant..

[11]  Prithviraj Sen,et al.  Active Learning for Large-Scale Entity Resolution , 2017, CIKM.

[12]  Hector Garcia-Molina,et al.  Entity Resolution with crowd errors , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[13]  Lise Getoor,et al.  Collective Entity Resolution in Familial Networks , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[14]  Jia Song,et al.  Progress and Challenges on Entity Alignment of Geographic Knowledge Bases , 2019, ISPRS Int. J. Geo Inf..

[15]  Renée J. Miller,et al.  A Collective, Probabilistic Approach to Schema Mapping , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[16]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[17]  Andreas Thor,et al.  MOMA - A Mapping-based Object Matching System , 2007, CIDR.

[18]  Zhifeng Bao,et al.  Balance-Aware Distributed String Similarity-Based Query Processing System , 2019, Proc. VLDB Endow..

[19]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[20]  Divesh Srivastava,et al.  Robust Entity Resolution using Random Graphs , 2018, SIGMOD Conference.

[21]  Gjergji Kasneci,et al.  SIGMa: simple greedy matching for aligning large knowledge bases , 2012, KDD.

[22]  Divesh Srivastava,et al.  Online Entity Resolution Using an Oracle , 2016, Proc. VLDB Endow..

[23]  Yufei Tao,et al.  Entity Matching with Active Monotone Classification , 2018, PODS.

[24]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[25]  Guoliang Li,et al.  PBA: Partition and Blocking Based Alignment for Large Knowledge Bases , 2016, DASFAA.

[26]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[27]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[28]  Olivier Teste,et al.  An Extensible Linear Approach for Holistic Ontology Matching , 2016, International Semantic Web Conference.

[29]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[30]  Guoliang Li,et al.  Hike: A Hybrid Human-Machine Method for Entity Alignment in Large-Scale Knowledge Bases , 2017, CIKM.

[31]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[32]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[33]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[34]  AnHai Doan,et al.  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services , 2017, SIGMOD Conference.

[35]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[36]  Raghav Kaushik,et al.  On active learning of record matching packages , 2010, SIGMOD Conference.

[37]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[38]  Vasilis Efthymiou,et al.  MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities , 2019, EDBT.

[39]  Guoliang Li,et al.  Truth Inference in Crowdsourcing: Is the Problem Solved? , 2017, Proc. VLDB Endow..

[40]  Andreas Krause,et al.  Lazier Than Lazy Greedy , 2014, AAAI.