Leveraging Clickstream Trajectories to Reveal Low-Quality Workers in Crowdsourced Forecasting Platforms

Crowdwork often entails tackling cognitively-demanding and time-consuming tasks. Crowdsourcing can be used for complex annotation tasks, from medical imaging to geospatial data, and such data powers sensitive applications, such as health diagnostics or autonomous driving. However, the existence and prevalence of underperforming crowdworkers is well-recognized, and can pose a threat to the validity of crowdsourcing. In this study, we propose the use of a computational framework to identify clusters of underperforming workers using clickstream trajectories. We focus on crowdsourced geopolitical forecasting. The framework can reveal different types of underperformers, such as workers with forecasts whose accuracy is far from the consensus of the crowd, those who provide low-quality explanations for their forecasts, and those who simply copy-paste their forecasts from other users. Our study suggests that clickstream clustering and analysis are fundamental tools to diagnose the performance of crowdworkers in platforms leveraging the wisdom of crowds.

[1]  David G. Rand,et al.  The online laboratory: conducting experiments in a real labor market , 2010, ArXiv.

[2]  Stefanie Nowak,et al.  New Strategies for Image Annotation: Overview of the Photo Annotation Task at ImageCLEF 2010 , 2010, CLEF.

[3]  Alon Y. Halevy,et al.  Crowdsourcing systems on the World-Wide Web , 2011, Commun. ACM.

[4]  Michael D. Buhrmester,et al.  Amazon's Mechanical Turk , 2011, Perspectives on psychological science : a journal of the Association for Psychological Science.

[5]  Jie Li,et al.  Characterizing typical and atypical user sessions in clickstreams , 2008, WWW.

[6]  A. Acquisti,et al.  Reputation as a sufficient condition for data quality on Amazon Mechanical Turk , 2013, Behavior Research Methods.

[7]  Steven V. Rouse,et al.  A reliability analysis of Mechanical Turk data , 2015, Comput. Hum. Behav..

[8]  M. Tamer Özsu,et al.  A Web page prediction model based on click-stream tree representation of user behavior , 2003, KDD '03.

[9]  A. James 2010 , 2011, Philo of Alexandria: an Annotated Bibliography 2007-2016.

[10]  Gang Wang,et al.  Unsupervised Clickstream Clustering for User Behavior Analysis , 2016, CHI.

[11]  Yang Liu,et al.  Non-Expert Evaluation of Summarization Systems is Risky , 2010, Mturk@HLT-NAACL.

[12]  Kwong-Sak Leung,et al.  A Survey of Crowdsourcing Systems , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[13]  George R. Klare,et al.  The measurement of readability , 1963 .

[14]  Aniket Kittur,et al.  Crowdsourcing user studies with Mechanical Turk , 2008, CHI.

[15]  Siddharth Suri,et al.  Conducting behavioral research on Amazon’s Mechanical Turk , 2010, Behavior research methods.

[16]  Mark Dredze,et al.  Annotating Named Entities in Twitter Data with Crowdsourcing , 2010, Mturk@HLT-NAACL.

[17]  Lyle H. Ungar,et al.  The Good Judgment Project: A Large Scale Test of Different Methods of Combining Expert Predictions , 2012, AAAI Fall Symposium: Machine Aggregation of Human Judgment.

[18]  Chris Kimble,et al.  UBB mining: finding unexpected browsing behaviour in clickstream data to improve a Web site's design , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[19]  Panagiotis G. Ipeirotis,et al.  Running Experiments on Amazon Mechanical Turk , 2010, Judgment and Decision Making.

[20]  Lin Lu,et al.  Mining Significant Usage Patterns from Clickstream Data , 2005, WEBKDD.

[21]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[23]  Karoline Mortensen,et al.  Comparing Amazon’s Mechanical Turk Platform to Conventional Data Collection Methods in the Health and Medical Research Literature , 2018, Journal of General Internal Medicine.

[24]  Virgílio A. F. Almeida,et al.  Characterizing user behavior in online social networks , 2009, IMC '09.

[25]  R. Zeckhauser,et al.  The Value of Precision in Probability Assessment: Evidence from a Large-Scale Geopolitical Forecasting Tournament , 2018 .

[26]  Aniket Kittur,et al.  Instrumenting the crowd: using implicit behavioral measures to predict task performance , 2011, UIST.

[27]  Jeffrey Heer,et al.  Parting Crowds: Characterizing Divergent Interpretations in Crowdsourced Annotation Tasks , 2016, CSCW.

[28]  Gang Wang,et al.  Northeastern University , 2021, IEEE Pulse.

[29]  M. Coleman,et al.  A computer readability formula designed for machine scoring. , 1975 .

[30]  D. Johnson,et al.  Participants at Your Fingertips , 2012 .

[31]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[32]  Wei Tang,et al.  Leveraging Peer Communication to Enhance Crowdsourcing , 2019, WWW.

[33]  Don A. Moore,et al.  Confidence Calibration in a Multiyear Geopolitical Forecasting Competition , 2017, Manag. Sci..

[34]  Christopher J. Holden,et al.  Assessing the reliability of the M5-120 on Amazon's mechanical Turk , 2013, Comput. Hum. Behav..

[35]  Duncan J. Watts,et al.  Cooperation and Contagion in Web-Based, Networked Public Goods Experiments , 2010, SECO.

[36]  Amar Cheema,et al.  Data collection in a flat world: the strengths and weaknesses of mechanical turk samples , 2013 .

[37]  Jaideep Srivastava,et al.  Web usage mining: discovery and applications of usage patterns from Web data , 2000, SKDD.

[38]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[39]  Kim Bartel Sheehan,et al.  Crowdsourcing research: Data collection with Amazon’s Mechanical Turk , 2018 .

[40]  Shuguang Han,et al.  Crowdsourcing Human Annotation on Web Page Structure , 2016, ACM Trans. Intell. Syst. Technol..

[41]  Janice Redish,et al.  Readability formulas have even more limitations than Klare discusses , 2000, AJCD.

[42]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[43]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[44]  ChenLu,et al.  A method for discovering clusters of e-commerce interest patterns using click-stream data , 2015 .

[45]  E A Smith,et al.  Automated readability index. , 1967, AMRL-TR. Aerospace Medical Research Laboratories.

[46]  Sangwon Park,et al.  What makes a useful online review? Implication for travel product websites. , 2015 .

[47]  Aniket Kittur,et al.  CrowdForge: crowdsourcing complex work , 2011, UIST.

[48]  Eric Schenk,et al.  Towards a characterization of crowdsourcing practices , 2011 .

[49]  Fei Xia,et al.  Preliminary Experiments with Amazon’s Mechanical Turk for Annotating Medical Named Entities , 2010, Mturk@HLT-NAACL.