DOCS: a domain-aware crowdsourcing system using knowledge bases

Crowdsourcing is a new computing paradigm that harnesses human effort to solve computer-hard problems, such as entity resolution and photo tagging. The crowd (or workers) have diverse qualities and it is important to effectively model a worker's quality. Most of existing worker models assume that workers have the same quality on different tasks. In practice, however, tasks belong to a variety of diverse domains, and workers have different qualities on different domains. For example, a worker who is a basketball fan should have better quality for the task of labeling a photo related to 'Stephen Curry' than the one related to 'Leonardo DiCaprio'. In this paper, we study how to leverage domain knowledge to accurately model a worker's quality. We examine using knowledge base (KB), e.g., Wikipedia and Freebase, to detect the domains of tasks and workers. We develop Domain Vector Estimation, which analyzes the domains of a task with respect to the KB. We also study Truth Inference, which utilizes the domain-sensitive worker model to accurately infer the true answer of a task. We design an Online Task Assignment algorithm, which judiciously and efficiently assigns tasks to appropriate workers. To implement these solutions, we have built DOCS, a system deployed on the Amazon Mechanical Turk. Experiments show that DOCS performs much better than the state-of-the-art approaches.

[1]  Bo Zhao,et al.  A Survey on Truth Discovery , 2015, SKDD.

[2]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[3]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[4]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[5]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988 .

[6]  Rob Miller,et al.  Crowdsourced Databases: Query Processing with People , 2011, CIDR.

[7]  Lilly Irani,et al.  Amazon Mechanical Turk , 2018, Advances in Intelligent Systems and Computing.

[8]  John C. Platt,et al.  Learning from the Wisdom of Crowds by Minimax Entropy , 2012, NIPS.

[9]  Guoliang Li,et al.  Crowdsourced Top-k Algorithms: An Experimental Evaluation , 2016, Proc. VLDB Endow..

[10]  Paolo Papotti,et al.  KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing , 2015, SIGMOD Conference.

[11]  Reynold Cheng,et al.  On Optimality of Jury Selection in Crowdsourcing , 2015, EDBT.

[12]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[13]  Tova Milo,et al.  OASSIS: query driven crowd mining , 2014, SIGMOD Conference.

[14]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[15]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[16]  Heng Ji,et al.  Modeling Truth Existence in Truth Discovery , 2015, KDD.

[17]  Heng Ji,et al.  FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation , 2015, KDD.

[18]  Salvatore Orlando,et al.  Dexter 2.0 - an Open Source Tool for Semantically Enriching Data , 2014, International Semantic Web Conference.

[19]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[20]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[21]  Jian Li,et al.  Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach , 2016, SIGMOD Conference.

[22]  Manuel Blum,et al.  Time Bounds for Selection , 1973, J. Comput. Syst. Sci..

[23]  Gianluca Demartini,et al.  ZenCrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking , 2012, WWW.

[24]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[25]  Zhifeng Bao,et al.  Crowdsourcing-based real-time urban traffic speed estimation: From trends to speeds , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[26]  Gagan Goel,et al.  Allocating tasks to workers with matching constraints: truthful mechanisms for crowdsourcing markets , 2014, WWW.

[27]  Ohad Greenshpan,et al.  Asking the Right Questions in Crowd Data Sourcing , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[28]  Wilfred Ng,et al.  Crowd-Selection Query Processing in Crowdsourcing Databases: A Task-Driven Approach , 2015, EDBT.

[29]  Huiping Sun,et al.  CQArank: jointly model topics and expertise in community question answering , 2013, CIKM.

[30]  Sanjeev Khanna,et al.  Using the crowd for top-k and group-by queries , 2013, ICDT '13.

[31]  Reynold Cheng,et al.  QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications , 2015, SIGMOD Conference.

[32]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[33]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[34]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[35]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[36]  Jennifer Widom,et al.  CrowdScreen: algorithms for filtering data with humans , 2012, SIGMOD Conference.

[37]  Zhifeng Bao,et al.  Crowdsourced POI labelling: Location-aware result inference and Task Assignment , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[38]  Sihem Amer-Yahia,et al.  Task assignment optimization in knowledge-intensive crowdsourcing , 2015, The VLDB Journal.

[39]  Dan Roth,et al.  Relational Inference for Wikification , 2013, EMNLP.

[40]  Beng Chin Ooi,et al.  CDAS: A Crowdsourcing Data Analytics System , 2012, Proc. VLDB Endow..

[41]  Chien-Ju Ho,et al.  Online Task Assignment in Crowdsourcing Markets , 2012, AAAI.

[42]  Beng Chin Ooi,et al.  iCrowd: An Adaptive Crowdsourcing Framework , 2015, SIGMOD Conference.

[43]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[44]  Nicholas R. Jennings,et al.  Efficient crowdsourcing of unknown experts using bounded multi-armed bandits , 2014, Artif. Intell..

[45]  Guoliang Li,et al.  Crowdsourced Data Management: A Survey , 2016, IEEE Transactions on Knowledge and Data Engineering.

[46]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[47]  David Gross-Amblard,et al.  Using Hierarchical Skills for Optimized Task Assignment in Knowledge-Intensive Crowdsourcing , 2016, WWW.