Parallel and Streaming Truth Discovery in Large-Scale Quantitative Crowdsourcing

To enable reliable crowdsourcing applications, it is of great importance to develop algorithms that can automatically discover the truths from possibly noisy and conflicting claims provided by various information sources. In order to handle crowdsourcing applications involving big or streaming data, a desirable truth discovery algorithm should not only be effective, but also be scalable. However, with respect to quantitative crowdsourcing applications such as object counting and percentage annotation, existing truth discovery algorithms are not simultaneously effective and scalable. They either address truth discovery in categorical crowdsourcing or perform batch processing that does not scale. In this paper, we propose new parallel and streaming truth discovery algorithms for quantitative crowdsourcing applications. Through extensive experiments on real-world and synthetic datasets, we demonstrate that 1) both of them are quite effective, 2) the parallel algorithm can efficiently perform truth discovery on large datasets, and 3) the streaming algorithm processes data incrementally, and it can efficiently perform truth discovery both on large datasets and in data streams.

[1]  Xiaogang Wang,et al.  Cross-scene crowd counting via deep convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Mani B. Srivastava,et al.  Truth Discovery in Crowdsourced Detection of Spatial Events , 2014, IEEE Transactions on Knowledge and Data Engineering.

[3]  Lydia B. Chilton,et al.  Exploring iterative and parallel human computation processes , 2010, HCOMP '10.

[4]  Pietro Perona,et al.  The Multidimensional Wisdom of Crowds , 2010, NIPS.

[5]  Bo Zhao,et al.  A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration , 2012, Proc. VLDB Endow..

[6]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[7]  Matei Zaharia,et al.  Resilient Distributed Datasets , 2016 .

[8]  D. Titterington Recursive Parameter Estimation Using Incomplete Data , 1984 .

[9]  Dan Roth,et al.  Knowing What to Believe (when you already know something) , 2010, COLING.

[10]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[11]  Lance Kaplan,et al.  On truth discovery in social sensing: A maximum likelihood estimation approach , 2012, 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN).

[12]  Michael S. Bernstein,et al.  Soylent: a word processor with a crowd inside , 2010, UIST.

[13]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[14]  Chris Callison-Burch,et al.  Crowdsourcing Translation: Professional Quality from Non-Professionals , 2011, ACL.

[15]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[16]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[17]  Mani B. Srivastava,et al.  Debiasing crowdsourced quantitative characteristics in local businesses and services , 2015, IPSN.

[18]  Charu C. Aggarwal,et al.  Using humans as sensors: An estimation-theoretic perspective , 2014, IPSN-14 Proceedings of the 13th International Symposium on Information Processing in Sensor Networks.

[19]  Gerardo Hermosillo,et al.  Learning From Crowds , 2010, J. Mach. Learn. Res..

[20]  Deborah Estrin,et al.  Recruitment Framework for Participatory Sensing Data Collections , 2010, Pervasive.

[21]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[22]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[23]  Alexander I. Rudnicky,et al.  Using the Amazon Mechanical Turk for transcription of spoken language , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Romit Roy Choudhury,et al.  If you see something, swipe towards it: crowdsourced event localization using smartphones , 2013, UbiComp.

[26]  Hisashi Kashima,et al.  Statistical quality estimation for general crowdsourcing tasks , 2013, HCOMP.

[27]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[28]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[29]  Charu C. Aggarwal,et al.  Recursive Fact-Finding: A Streaming Approach to Truth Estimation in Crowdsourcing Applications , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[30]  Daniel Jackson,et al.  Occupancy monitoring using environmental & context sensors and a hierarchical analysis framework , 2014, BuildSys@SenSys.

[31]  O. Cappé,et al.  On‐line expectation–maximization algorithm for latent data models , 2009 .

[32]  Benjamin B. Bederson,et al.  Human computation: a survey and taxonomy of a growing field , 2011, CHI.