Cost-Effective Data Annotation using Game-Based Crowdsourcing

Large-scale data annotation is indispensable for many applications, such as machine learning and data integration. However, existing annotation solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective annotation approach, and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality. To address the problem, we first generate candidate rules, and then devise a game-based crowdsourcing approach CrowdGame to select high-quality rules by considering coverage and precision. CrowdGame employs two groups of crowd workers: one group answers rule validation tasks (whether a rule is valid) to play a role of rule generator, while the other group answers tuple checking tasks (whether the annotated label of a data tuple is correct) to play a role of rule refuter. We let the two groups play a two-player game: rule generator identifies high-quality rules with large coverage and precision, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules covering the tuples. This paper studies the challenges in CrowdGame. The first is to balance the trade-off between coverage and precision. We define the loss of a rule by considering the two factors. The second is rule precision estimation. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss.We introduce a minimax strategy and develop efficient task selection algorithms. We conduct experiments on entity matching and relation extraction, and the results show that our method outperforms state-of-the-art solutions. PVLDB Reference Format: Jingru Yang, Ju Fan, Zhewei Wei, Guoliang Li, Tongyu Liu, Xiaoyong Du. Cost-Effective Data Annotation using Game-Based Crowdsourcing. PVLDB, 12(1): 57-70, 2018. DOI: https://doi.org/10.14778/3275536.3275541 ∗Ju Fan is the corresponding author. This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 12, No. 1 ISSN 2150-8097. DOI: https://doi.org/10.14778/3275536.3275541

[1]  Lei Chen,et al.  Online mobile Micro-Task Allocation in spatial crowdsourcing , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[2]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[3]  Jeffrey F. Naughton,et al.  Corleone: hands-off crowdsourcing for entity matching , 2014, SIGMOD Conference.

[4]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[5]  Jian Li,et al.  Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach , 2016, SIGMOD Conference.

[6]  Guoliang Li,et al.  Human-in-the-loop Data Integration , 2017, Proc. VLDB Endow..

[7]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Sibo Wang,et al.  Crowd-Based Deduplication: An Adaptive Approach , 2015, SIGMOD Conference.

[9]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[10]  AnHai Doan,et al.  Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services , 2017, SIGMOD Conference.

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[13]  Luke S. Zettlemoyer,et al.  Knowledge-Based Weak Supervision for Information Extraction of Overlapping Relations , 2011, ACL.

[14]  Jennifer Widom,et al.  Deco: A System for Declarative Crowdsourcing , 2012, Proc. VLDB Endow..

[15]  Beng Chin Ooi,et al.  CrowdOp: Query Optimization for Declarative Crowdsourcing Systems , 2015, IEEE Transactions on Knowledge and Data Engineering.

[16]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[17]  Guoliang Li,et al.  Truth Inference in Crowdsourcing: Is the Problem Solved? , 2017, Proc. VLDB Endow..

[18]  Yuval Kluger,et al.  Ranking and combining multiple predictors without labeled data , 2013, Proceedings of the National Academy of Sciences.

[19]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[20]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[21]  Beng Chin Ooi,et al.  iCrowd: An Adaptive Crowdsourcing Framework , 2015, SIGMOD Conference.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Yannis Papakonstantinou,et al.  Waldo: An Adaptive Human Interface for Crowd Entity Resolution , 2017, SIGMOD Conference.

[24]  Peng Zhang,et al.  IRGAN: A Minimax Game for Unifying Generative and Discriminative Information Retrieval Models , 2017, SIGIR.

[25]  Hector Garcia-Molina,et al.  Attribute-based Crowd Entity Resolution , 2016, CIKM.

[26]  David R. Karger,et al.  Demonstration of Qurk: a query processor for humanoperators , 2011, SIGMOD '11.

[27]  Nilesh N. Dalvi,et al.  Crowdsourcing Algorithms for Entity Resolution , 2014, Proc. VLDB Endow..

[28]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Eugene Wu,et al.  CLAMShell: Speeding up Crowds for Low-latency Data Labeling , 2015, Proc. VLDB Endow..

[30]  Beng Chin Ooi,et al.  CDAS: A Crowdsourcing Data Analytics System , 2012, Proc. VLDB Endow..

[31]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[32]  Jian Li,et al.  CDB: Optimizing Queries with Crowd-Based Selections and Joins , 2017, SIGMOD Conference.

[33]  Beng Chin Ooi,et al.  A hybrid machine-crowdsourcing system for matching web tables , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[34]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[35]  Claire Elayne Bangerter Owen Parameter Estimation for the Beta Distribution , 2008 .

[36]  Guoliang Li,et al.  Human-in-the-loop Rule Learning for Data Integration , 2018, IEEE Data Eng. Bull..

[37]  Aditya G. Parameswaran,et al.  Comprehensive and reliable crowd assessment algorithms , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[38]  Hiroshi Nakagawa,et al.  Reducing Wrong Labels in Distant Supervision for Relation Extraction , 2012, ACL.

[39]  Tim Kraska,et al.  Leveraging transitive relations for crowdsourced joins , 2013, SIGMOD '13.

[40]  Christopher De Sa,et al.  Data Programming: Creating Large Training Sets, Quickly , 2016, NIPS.

[41]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[42]  Reynold Cheng,et al.  QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications , 2015, SIGMOD Conference.

[43]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[44]  Alessandro Moschitti,et al.  Self-Crowdsourcing Training for Relation Extraction , 2017, ACL.

[45]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[46]  Hector Garcia-Molina,et al.  Question Selection for Crowd Entity Resolution , 2013, Proc. VLDB Endow..

[47]  Guoliang Li,et al.  Crowdsourced Data Management: A Survey , 2016, IEEE Transactions on Knowledge and Data Engineering.

[48]  Dietrich Klakow,et al.  Combining Generative and Discriminative Model Scores for Distant Supervision , 2013, EMNLP.

[49]  Angli Liu,et al.  Effective Crowd Annotation for Relation Extraction , 2016, NAACL.

[50]  Lidan Shou,et al.  SLADE: A Smart Large-Scale Task Decomposer in Crowdsourcing , 2018, IEEE Transactions on Knowledge and Data Engineering.

[51]  Panagiotis G. Ipeirotis,et al.  Get another label? improving data quality and data mining using multiple, noisy labelers , 2008, KDD.

[52]  J. Jenkins,et al.  Word association norms , 1964 .