Select Your Questions Wisely: For Entity Resolution With Crowd Errors

Crowdsourcing is becoming increasingly important in entity resolution tasks due to their inherent complexity such as clustering of images and natural language processing. Humans can provide more insightful information for these difficult problems compared to machine-based automatic techniques. Nevertheless, human workers can make mistakes due to lack of domain expertise or seriousness, ambiguity, or even due to malicious intents. The bulk of literature usually deals with human errors via majority voting or by assigning a universal error rate over crowd workers. However, such approaches are incomplete, and often inconsistent, because the expertise of crowd workers are diverse with possible biases, thereby making it largely inappropriate to assume a universal error rate for all workers over all crowdsourcing tasks. We mitigate the above challenges by considering an uncertain graph model, where the edge probability between two records A and B denotes the ratio of crowd workers who voted YES on the question if A and B are same entity. To reflect independence across different crowdsourcing tasks, we apply the notion of possible worlds, and develop parameter-free algorithms for both next crowdsourcing and entity resolution tasks. In particular, for next crowdsourcing, we identify the record pair that maximally increases the reliability of the current clustering. Since reliability takes into account the connected-ness inside and across all clusters, this metric is more effective in deciding next questions, in comparison with state-of-the-art works, which consider local features, such as individual edges, paths, or nodes to select next crowdsourcing questions. Based on detailed empirical analysis over real-world datasets, we find that our proposed solution, PERC (probabilistic entity resolution with imperfect crowd) improves the quality by 15% and reduces the overall cost by 50% for the crowdsourcing-based entity resolution.

[1]  Aristides Gionis,et al.  Fast Reliability Search in Uncertain Graphs , 2014, EDBT.

[2]  Beng Chin Ooi,et al.  iCrowd: An Adaptive Crowdsourcing Framework , 2015, SIGMOD Conference.

[3]  Lei Chen,et al.  Data-driven crowdsourcing: Management, mining, and applications , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[4]  Ashwin Machanavajjhala,et al.  Network sampling , 2013, KDD.

[5]  Tova Milo,et al.  Skyline Queries with Noisy Comparisons , 2015, PODS.

[6]  Vikas Kumar,et al.  CrowdSearch: exploiting crowds for accurate real-time image search on mobile phones , 2010, MobiSys '10.

[7]  Beng Chin Ooi,et al.  CDAS: A Crowdsourcing Data Analytics System , 2012, Proc. VLDB Endow..

[8]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[9]  Alessandro Bozzon,et al.  Reactive crowdsourcing , 2013, WWW.

[10]  Jennifer Widom,et al.  Human-assisted graph search: it's okay to ask questions , 2011, Proc. VLDB Endow..

[11]  Tova Milo,et al.  On the Complexity of Mining Itemsets from the Crowd Using Taxonomies , 2014, ICDT.

[12]  Omar Alonso,et al.  Crowdsourcing for relevance evaluation , 2008, SIGF.

[13]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[14]  Paul N. Bennett,et al.  Pairwise ranking aggregation in a crowdsourced setting , 2013, WSDM.

[15]  Lydia B. Chilton,et al.  TurKit: human computation algorithms on mechanical turk , 2010, UIST.

[16]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[17]  Pietro Perona,et al.  Crowdclustering , 2011, NIPS.

[18]  David R. Karger,et al.  Counting with the Crowd , 2012, Proc. VLDB Endow..

[19]  Lei Chen,et al.  Whom to Ask? Jury Selection for Decision Making Tasks on Micro-blog Services , 2012, Proc. VLDB Endow..

[20]  Jian Li,et al.  Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach , 2016, SIGMOD Conference.

[21]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[22]  Nebojsa Jojic,et al.  Active spectral clustering via iterative uncertainty reduction , 2012, KDD.

[23]  Reynold Cheng,et al.  On Optimality of Jury Selection in Crowdsourcing , 2015, EDBT.

[24]  Aditya G. Parameswaran,et al.  Evaluating the crowd with confidence , 2013, KDD.

[25]  David R. Karger,et al.  Human-powered Sorts and Joins , 2011, Proc. VLDB Endow..

[26]  Sibo Wang,et al.  Crowd-Based Deduplication: An Adaptive Approach , 2015, SIGMOD Conference.

[27]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[28]  Hector Garcia-Molina,et al.  Entity Resolution with crowd errors , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[29]  Keith Paton,et al.  An algorithm for finding a fundamental set of cycles of a graph , 1969, CACM.

[30]  Joseph Polifroni,et al.  Crowd translator: on building localized speech recognizers through micropayments , 2010, OPSR.

[31]  Reynold Cheng,et al.  QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications , 2015, SIGMOD Conference.

[32]  Ohad Shamir,et al.  Vox Populi: Collecting High-Quality Labels from a Crowd , 2009, COLT.

[33]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[34]  Thomas Pfeiffer,et al.  Adaptive Polling for Information Aggregation , 2012, AAAI.

[35]  Aditya G. Parameswaran,et al.  Crowdsourced Data Management: Industry and Academic Perspectives , 2015, Found. Trends Databases.

[36]  Alberto O. Mendelzon,et al.  Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[37]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[38]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[39]  Chin-Laung Lei,et al.  A crowdsourceable QoE evaluation framework for multimedia content , 2009, ACM Multimedia.

[40]  David R. Karger,et al.  Demonstration of Qurk: a query processor for humanoperators , 2011, SIGMOD '11.

[41]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[42]  Isaac Meilijson,et al.  Filtering With the Crowd: CrowdScreen Revisited , 2016, ICDT.

[43]  Bin Bi,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2012 .

[44]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[45]  Anja Gruenheid,et al.  Crowdsourcing Entity Resolution: When is A=B? , 2012 .

[46]  M. Elsner,et al.  Bounding and Comparing Methods for Correlation Clustering Beyond ILP , 2009, ILP 2009.

[47]  Alon Y. Halevy,et al.  Pay-as-you-go user feedback for dataspace systems , 2008, SIGMOD Conference.

[48]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[49]  David W. Jacobs,et al.  Active image clustering: Seeking constraints from humans to complement algorithms , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Jennifer Widom,et al.  Deco: declarative crowdsourcing , 2012, CIKM.

[51]  Aditya G. Parameswaran,et al.  So who won?: dynamic max discovery with the crowd , 2012, SIGMOD Conference.

[52]  Neoklis Polyzotis,et al.  Max algorithms in crowdsourcing environments , 2012, WWW.

[53]  Haixun Wang,et al.  Distance-Constraint Reachability Computation in Uncertain Graphs , 2011, Proc. VLDB Endow..

[54]  Aditya G. Parameswaran,et al.  Answering Queries using Humans, Algorithms and Databases , 2011, CIDR.