Let's Agree to Disagree: Measuring Agreement between Annotators for Opinion Mining Task

There is a need to know up to what degree humans can agree when classifying a sentence as carrying some sentiment orientation. However, a little research has been done on assessing the agreement between annotators for the different opinion mining tasks. In this work we present an assessment of agreement between two human annotators. The task was to manually classify newspaper sentences into one of three classes. For assessing the level of agreement, Cohen’s kappa coefficient was computed. Results show that annotators agree more for negative classes than for positive or neutral. We observed that annotators might agree up to a level of substantial agreement of 0.65 for the best case or 0.30 for the worst.

[1]  Alan F. Smeaton,et al.  A study of inter-annotator agreement for opinion retrieval , 2009, SIGIR.

[2]  Namita Mittal,et al.  Machine Learning Approach for Sentiment Analysis , 2016 .

[3]  Janyce Wiebe,et al.  MPQA 3.0: An Entity/Event-Level Sentiment Corpus , 2015, NAACL.

[4]  M. Thelwall,et al.  Sentiment Strength Detection in Short Informal Text 1 , 2010 .

[5]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[6]  Lei Zhang,et al.  A Survey of Opinion Mining and Sentiment Analysis , 2012, Mining Text Data.

[7]  Stefanie Nowak,et al.  How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation , 2010, MIR '10.

[8]  Bing Liu,et al.  Aspect and Entity Extraction for Opinion Mining , 2014 .

[9]  Gerardo Sierra,et al.  On the Development of the RST Spanish Treebank , 2011, Linguistic Annotation Workshop.

[10]  Sampo Pyysalo,et al.  Cell line name recognition in support of the identification of synthetic lethality in cancer from text , 2015, Bioinform..

[11]  Ismael Díaz Rangel,et al.  Creación y evaluación de un diccionario marcado con emociones y ponderado para el español , 2014 .

[12]  Finn Årup Nielsen,et al.  A New ANEW: Evaluation of a Word List for Sentiment Analysis in Microblogs , 2011, #MSM.

[13]  Ramin Khorasani,et al.  Assessing Strength of Evidence of Appropriate Use Criteria for Diagnostic Imaging Examinations , 2016, J. Am. Medical Informatics Assoc..

[14]  Thorsten Brants,et al.  Inter-annotator Agreement for a German Newspaper Corpus , 2000, LREC.

[15]  Chung Yong Lim,et al.  A Case Study on Inter-Annotator Agreement for Word Sense Disambiguation , 1999 .

[16]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[17]  Claire Cardie,et al.  Annotating Expressions of Opinions and Emotions in Language , 2005, Lang. Resour. Evaluation.

[18]  Eric SanJuan,et al.  A Turing Test to Evaluate a Complex Summarization Task , 2013, CLEF.

[19]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[20]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.