A Crowdsourcing Worker Quality Evaluation Algorithm on MapReduce for Big Data Applications

Crowdsourcing is a new emerging distributed computing and business model on the backdrop of Internet blossoming. With the development of crowdsourcing systems, the data size of crowdsourcers, contractors and tasks grows rapidly. The worker quality evaluation based on big data analysis technology has become a critical challenge. This paper first proposes a general worker quality evaluation algorithm that is applied to any critical tasks such as tagging, matching, filtering, categorization and many other emerging applications, without wasting resources. Second, we realize the evaluation algorithm in the Hadoop platform using the MapReduce parallel programming model. Finally, to effectively verify the accuracy and the effectiveness of the algorithm in a wide variety of big data scenarios, we conduct a series of experiments. The experimental results demonstrate that the proposed algorithm is accurate and effective. It has high computing performance and horizontal scalability. And it is suitable for large-scale worker quality evaluations in a big data environment.

[1]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[2]  Yunhao Liu,et al.  Robust Trajectory Estimation for Crowdsourcing-Based Mobile Applications , 2014, IEEE Transactions on Parallel and Distributed Systems.

[3]  Pietro Perona,et al.  Online crowdsourcing: Rating annotators and obtaining cost-effective labels , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[6]  Matthew Lease,et al.  Beyond AMT: An Analysis of Crowd Work Platforms , 2013, ArXiv.

[7]  Gerardo Hermosillo,et al.  Supervised learning from multiple experts: whom to trust when everyone lies a bit , 2009, ICML '09.

[8]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[9]  Elisa Bertino,et al.  Quality Control in Crowdsourcing Systems: Issues and Directions , 2013, IEEE Internet Computing.

[10]  Panagiotis G. Ipeirotis,et al.  Quality management on Amazon Mechanical Turk , 2010, HCOMP '10.

[11]  Anthony K. H. Tung,et al.  K-Anonymity for Crowdsourcing Database , 2014, IEEE Transactions on Knowledge and Data Engineering.

[12]  Inc. Alias-i Multilevel Bayesian Models of Categorical Data Annotation , 2008 .

[13]  Jennifer G. Dy,et al.  Active Learning from Crowds , 2011, ICML.

[14]  Aditya G. Parameswaran,et al.  Evaluating the crowd with confidence , 2013, KDD.

[15]  Daniel Schall Automatic Quality Management in Crowdsourcing [Leading Edge] , 2013, IEEE Technology and Society Magazine.

[16]  Derek Greene,et al.  Using Crowdsourcing and Active Learning to Track Sentiment in Online Media , 2010, ECAI.

[17]  John Le,et al.  Ensuring quality in crowdsourced search relevance evaluation: The effects of training question distribution , 2010 .

[18]  Yi Pan,et al.  Parallel rough set based knowledge acquisition using MapReduce from big data , 2012, BigMine '12.

[19]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[20]  Tim Kraska,et al.  CrowdER: Crowdsourcing Entity Resolution , 2012, Proc. VLDB Endow..

[21]  Rong Jin,et al.  Online feature selection for mining big data , 2012, BigMine '12.

[22]  Jian Peng,et al.  Variational Inference for Crowdsourcing , 2012, NIPS.

[23]  C. Lynch Big data: How do your data grow? , 2008, Nature.

[24]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[25]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[26]  Bastien Chopard,et al.  Crowdsourcing Satellite Imagery Analysis: Study of Parallel and Iterative Models , 2012, GIScience.

[27]  Eric Horvitz,et al.  Combining human and machine intelligence in large-scale crowdsourcing , 2012, AAMAS.

[28]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[29]  Beng Chin Ooi,et al.  CDAS: A Crowdsourcing Data Analytics System , 2012, Proc. VLDB Endow..

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[31]  Jiayu Tang,et al.  Examining the Limits of Crowdsourcing for Relevance Assessment , 2013, IEEE Internet Computing.

[32]  Lei Chen,et al.  CrowdCleaner: Data cleaning for multi-version data on the web via crowdsourcing , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[33]  Sanjay Ghemawat,et al.  MapReduce: a flexible data processing tool , 2010, CACM.

[34]  Keith W. Miller,et al.  Big Data: New Opportunities and New Challenges [Guest editors' introduction] , 2013, Computer.

[35]  Aniket Kittur,et al.  Instrumenting the crowd: using implicit behavioral measures to predict task performance , 2011, UIST.

[36]  Bin Bi,et al.  Iterative Learning for Reliable Crowdsourcing Systems , 2012 .

[37]  Mark W. Schmidt,et al.  Modeling annotator expertise: Learning when everybody knows a bit of something , 2010, AISTATS.

[38]  Aditya Ramesh Identifying Reliable Workers Swiftly , 2012 .

[39]  Chris Callison-Burch,et al.  Fast, Cheap, and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk , 2009, EMNLP.

[40]  Michael I. Jordan,et al.  Bayesian Bias Mitigation for Crowdsourcing , 2011, NIPS.

[41]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[42]  C. Ambroise The EM Algorithm and Extensions, by G.M. McLachlan and T. Krishnan , 1998 .

[43]  Maya R. Gupta,et al.  Theory and Use of the EM Algorithm , 2011, Found. Trends Signal Process..

[44]  Daren C. Brabham Crowdsourcing as a Model for Problem Solving , 2008 .