Entity Resolution with crowd errors

Given a set of records, an Entity Resolution (ER) algorithm finds records that refer to the same real-world entity. Humans can often determine if two records refer to the same entity, and hence we study the problem of selecting questions to ask error-prone humans. We give a Maximum Likelihood formulation for the problem of finding the “most beneficial” questions to ask next. Our theoretical results lead to a lightweight and practical algorithm, bDENSE, for selecting questions to ask humans. Our experimental results show that bDENSE can more quickly reach an accurate outcome, compared to two approaches proposed recently. Moreover, through our experimental evaluation, we identify the strengths and weaknesses of all three approaches.

[1]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[2]  David W. Jacobs,et al.  Active image clustering: Seeking constraints from humans to complement algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[4]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[5]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[6]  Shai Bagon,et al.  Large Scale Correlation Clustering Optimization , 2011, ArXiv.

[7]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[8]  Hector Garcia-Molina,et al.  Quality control for comparison microtasks , 2012, CrowdKDD '12.

[9]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[10]  Nebojsa Jojic,et al.  Active spectral clustering via iterative uncertainty reduction , 2012, KDD.

[11]  Jennifer Widom,et al.  CrowdScreen: algorithms for filtering data with humans , 2012, SIGMOD Conference.

[12]  Anja Gruenheid,et al.  Crowdsourcing Entity Resolution: When is A=B? , 2012 .

[13]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[14]  Bruce Elliott,et al.  Gymnastics , 2003, Sports biomechanics.

[15]  Ohad Shamir,et al.  Spectral Clustering on a Budget , 2011, AISTATS.

[16]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[17]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[19]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[20]  Lilly Irani,et al.  Amazon Mechanical Turk , 2018, Advances in Intelligent Systems and Computing.

[21]  Pietro Perona,et al.  Crowdclustering , 2011, NIPS.