How do annotators label short texts? Toward understanding the temporal dynamics of tweet labeling

Abstract Crowdsourcing is a popular means to obtain human-crafted information, for example labels of tweets, which can then be used in text mining tasks. Many studies investigate the quality of the labels submitted by human annotators, but there is less work on understanding how annotators label. It is quite natural to expect that annotators learn how to annotate and do so gradually, in the sense that they do not know in advance which of the tweets they will see are positive and which are negative, but rather figure out gradually what makes up the positive and the negative sentiment in a tweet. In this paper, we investigate this gradual process and its temporal dynamics. We show that annotators undergo two phases, a learning phase during which they build a conceptual model of the characteristics determining the sentiment of a tweet, and an exploitation phase during which they use their conceptual model, albeit learning and refinement of the model continues. As case study we investigate a hierarchical tweet labeling task, distinguishing first between relevant and irrelevant tweets, before classifying the relevant ones into factual and non-factual, and further splitting the non-factual ones into positive and negative. As indicator of learning we use the annotation time, i.e. the elapsed time for the inspection of a tweet before the labels across the hierarchy are assigned to it. We show that this annotation time drops as an annotator proceeds through the set of tweets she has to process. We report on our results on identifying the learning phase and its follow-up exploitation phase, and on the differences in annotator behavior during each phase.

[1]  Lale Akarun,et al.  Modeling annotator behaviors for crowd labeling , 2015, Neurocomputing.

[2]  Alex A. Freitas,et al.  A survey of hierarchical classification across different application domains , 2010, Data Mining and Knowledge Discovery.

[3]  Hironobu Takagi,et al.  Involving Senior Workers in Crowdsourced Proofreading , 2014, HCI.

[4]  Aniket Kittur,et al.  Instrumenting the crowd: using implicit behavioral measures to predict task performance , 2011, UIST.

[5]  James R. Glass,et al.  A Transcription Task for Crowdsourcing with Automatic Quality Control , 2011, INTERSPEECH.

[6]  Karl Aberer,et al.  Leveraging user expertise in collaborative systems for annotating energy datasets , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[7]  Christopher P. Cerasoli,et al.  Intrinsic motivation and extrinsic incentives jointly predict performance: a 40-year meta-analysis. , 2014, Psychological bulletin.

[8]  Eric K. Ringger,et al.  Assessing the Costs of Machine-Assisted Corpus Annotation through a User Study , 2008, LREC.

[9]  Peng Dai,et al.  POMDP-based control of workflows for crowdsourcing , 2013, Artif. Intell..

[10]  Carl F. Salk,et al.  Comparing the Quality of Crowdsourced Data Contributed by Expert and Non-Experts , 2013, PloS one.

[11]  Gabriella Kazai,et al.  Quality Management in Crowdsourcing using Gold Judges Behavior , 2016, WSDM.

[12]  Jason Baldridge,et al.  How well does active learning actually work? Time-based evaluation of cost-reduction strategies for language documentation. , 2009, EMNLP.

[13]  Carolyn Penstein Rosé,et al.  Estimating Annotation Cost for Active Learning in a Multi-Annotator Environment , 2009, HLT-NAACL 2009.

[14]  Dana Chandler,et al.  Breaking Monotony with Meaning: Motivation in Crowdsourcing Markets , 2012, ArXiv.

[15]  Gabriella Kazai,et al.  An analysis of human factors and label accuracy in crowdsourcing relevance judgments , 2013, Information Retrieval.

[16]  Eric Gilbert,et al.  Comparing Person- and Process-centric Strategies for Obtaining Quality Data on Amazon Mechanical Turk , 2015, CHI.

[17]  Lei Guo,et al.  Dynamic Allocation of Crowd Contributions for Sentiment Analysis during the 2016 U.S. Presidential Election , 2016, ArXiv.

[18]  Hinrich Schütze,et al.  Active Learning with Amazon Mechanical Turk , 2011, EMNLP.

[19]  Jaime G. Carbonell,et al.  Proactive learning: cost-sensitive active learning with multiple imperfect oracles , 2008, CIKM '08.

[20]  A. Tversky,et al.  Prospect Theory : An Analysis of Decision under Risk Author ( s ) : , 2007 .

[21]  Christopher P. Cerasoli,et al.  Intrinsic Motivation and Extrinsic Incentives Jointly Predict Performance , 2014 .

[22]  Stan Matwin,et al.  Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization , 2006, Canadian AI.

[23]  A. Tversky,et al.  On the psychology of prediction , 1973 .

[24]  Ricardo Kawase,et al.  Training Workers for Improving Performance in Crowdsourcing Microtasks , 2015, EC-TEL.

[25]  A. Tversky,et al.  Prospect theory: an analysis of decision under risk — Source link , 2007 .

[26]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[27]  Ben Carterette,et al.  An Analysis of Assessor Behavior in Crowdsourced Preference Judgments , 2010 .

[28]  Shuguang Han,et al.  Crowdsourcing Human Annotation on Web Page Structure , 2016, ACM Trans. Intell. Syst. Technol..

[29]  Stefan Dietze,et al.  Understanding Malicious Behavior in Crowdsourcing Platforms: The Case of Online Surveys , 2015, CHI.

[30]  Paul Lukowicz,et al.  From Active Learning to Dedicated Collaborative Interactive Learning , 2016 .

[31]  Shaojian Zhu,et al.  A crowdsourcing quality control model for tasks distributed in parallel , 2012, CHI EA '12.