A study of inter-annotator agreement for opinion retrieval

Evaluation of sentiment analysis, like large-scale IR evaluation, relies on the accuracy of human assessors to create judgments. Subjectivity in judgments is a problem for relevance assessment and even more so in the case of sentiment annotations. In this study we examine the degree to which assessors agree upon sentence-level sentiment annotation. We show that inter-assessor agreement is not contingent on document length or frequency of sentiment but correlates positively with automated opinion retrieval performance. We also examine the individual annotation categories to determine which categories pose most difficulty for annotators.