How Does Tweet Difficulty Affect Labeling Performance of Annotators?

Crowdsourcing is a popular means to obtain labeled data at moderate costs, for example for tweets, which can then be used in text mining tasks. To alleviate the problem of low-quality labels in this context, multiple human factors have been analyzed to identify and deal with workers who provide such labels. However, one aspect that has been rarely considered is the inherent difficulty of tweets to be labeled and how this affects the reliability of the labels that annotators assign to such tweets. Therefore, we investigate in this preliminary study this connection using a hierarchical sentiment labeling task on Twitter. We find that there is indeed a relationship between both factors, assuming that annotators have labeled some tweets before: labels assigned to easy tweets are more reliable than those assigned to difficult tweets. Therefore, training predictors on easy tweets enhances the performance by up to 6% in our experiment. This implies potential improvements for active learning techniques and crowdsourcing.

[1]  Stefan Dietze,et al.  A taxonomy of microtasks on the web , 2014, HT.

[2]  Timo Honkela,et al.  Assessing user-specific difficulty of documents , 2013, Inf. Process. Manag..

[3]  Mausam,et al.  Joint Crowdsourcing of Multiple Tasks , 2013, HCOMP.

[4]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[5]  Ben Carterette,et al.  An Analysis of Assessor Behavior in Crowdsourced Preference Judgments , 2010 .

[6]  Elad Yom-Tov,et al.  What makes a query difficult? , 2006, SIGIR.

[7]  Yücel Saygin,et al.  How do annotators label short texts? Toward understanding the temporal dynamics of tweet labeling , 2018, Inf. Sci..

[8]  Rong Zhang,et al.  Enhancing Topic Modeling on Short Texts with Crowdsourcing , 2016, ACML.

[9]  Javier R. Movellan,et al.  Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise , 2009, NIPS.

[10]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[11]  and software — performance evaluation , .

[12]  Eneko Agirre,et al.  Predicting word sense annotation agreement , 2015, LSDSem@EMNLP.

[13]  Stan Matwin,et al.  Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization , 2006, Canadian AI.

[14]  Yücel Saygin,et al.  Predicting Worker Disagreement for More Effective Crowd Labeling , 2018, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA).

[15]  Ricardo Kawase,et al.  Training Workers for Improving Performance in Crowdsourcing Microtasks , 2015, EC-TEL.

[16]  Jorge Gonçalves,et al.  Task Routing and Assignment in Crowdsourcing based on Cognitive Abilities , 2017, WWW.

[17]  Lei Guo,et al.  Dynamic Allocation of Crowd Contributions for Sentiment Analysis during the 2016 U.S. Presidential Election , 2016, ArXiv.

[18]  Andrew McCallum,et al.  Reducing Labeling Effort for Structured Prediction Tasks , 2005, AAAI.

[19]  Marco Roccetti,et al.  Crowdsourcing Urban Accessibility:: Some Preliminary Experiences with Results , 2015, CHItaly.

[20]  Mario Gerla,et al.  FreeLoc: Calibration-free crowdsourced indoor localization , 2013, 2013 Proceedings IEEE INFOCOM.

[21]  Thomas Roelleke,et al.  Document Difficulty Framework for Semi-automatic Text Classification , 2013, DaWaK.

[22]  Gabriella Kazai,et al.  An analysis of human factors and label accuracy in crowdsourcing relevance judgments , 2013, Information Retrieval.

[23]  Marco Basaldella,et al.  Crowdsourcing Relevance Assessments: The Unexpected Benefits of Limiting the Time to Judge , 2016, HCOMP.

[24]  Kalina Bontcheva,et al.  Crowdsourcing Named Entity Recognition and Entity Linking Corpora , 2017 .

[25]  Xiaoying Gan,et al.  Incentivize Multi-Class Crowd Labeling Under Budget Constraint , 2017, IEEE Journal on Selected Areas in Communications.