The Unreliability of Measures of Intercoder Reliability , and What to do About it

In both automated and traditional text analysis, human coders are regularly tasked with categorizing documents. Researchers then evaluate the success of this crucial step in the research process via one of many measures of intercoder reliability, such as Cronbachs alpha. They then improve coding practices until this measure reaches some arbitrary threshold, at which point remaining disagreements are resolved in arbitrary ways and ignored in subsequent analyses. We show that this common practice can generate severely biased estimates and misleading conclusions. The problem is the focus on measures of intercoder reliability which, except at the extreme, are unrelated to the quantities of interest, such as the proportion of documents in each category. We thus develop an approach that enables scholars to directly incorporate coding uncertainty into statistical estimation. The method offers an interval estimate which we prove contains the true proportion of documents in each category, under reasonable assumptions. We then extend this method to situations with multiple coders, when one coder is trusted more than another, and when the resulting document codes are used as inputs to another statistical model. We offer easy-to-use software that implements all our suggestions. ∗Department of Political Science, Stanford University, JustinGrimmer.org, jgrimmer@stanford.edu. †Institute for Quantitative Social Science, Harvard University; GaryKing.org, king@harvard.edu, (617) 500-7570. ‡Institute for Quantitative Social Science, Harvard University, scholar.harvard.edu/csuperti, csuperti@fas.harvard.edu 1

[1]  L. Cronbach Coefficient alpha and the internal structure of tests , 1951 .

[2]  Francesca Molinari Partial identification of probability distributions with misclassified data , 2008 .

[3]  James N. Druckman,et al.  Campaign Communications in U.S. Congressional Elections , 2009, American Political Science Review.

[4]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[5]  Brandon M. Stewart,et al.  Use of force and civil–military relations in Russia: an automated content analysis , 2009 .

[6]  Justin Grimmer,et al.  The Impression of Influence: Legislator Communication, Representation, and Democratic Accountability , 2014 .

[7]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[8]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[9]  Denis A. Grégoire,et al.  Entrepreneurship Education Research Revisited: The Case of Higher Education , 2005 .

[10]  James D. Ivory,et al.  The virtual census: representations of gender, race and age in video games , 2009, New Media Soc..

[11]  Gary King,et al.  A Unified Approach to Measurement Error and Missing Data: Details and Extensions , 2017 .

[12]  M. Lombard,et al.  Content Analysis in Mass Communication: Assessment and Reporting of Intercoder Reliability , 2002 .

[13]  Permalink Conceptual " Stretching " Revisited : Adapting Categories in Comparative Analysis , 2007 .

[14]  W. A. Scott,et al.  Reliability of Content Analysis ; The Case of Nominal Scale Cording , 1955 .

[15]  S Wacholder,et al.  Validation studies using an alloyed gold standard. , 1993, American journal of epidemiology.

[16]  Kenneth Benoit,et al.  Coder Reliability and Misclassification in the Human Coding of Party Manifestos , 2012, Political Analysis.

[17]  George Hripcsak,et al.  Measuring agreement in medical informatics reliability studies , 2002, J. Biomed. Informatics.

[18]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[19]  Mark Leccese,et al.  Online Information Sources of Political Blogs , 2009 .

[20]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[21]  Hajo G. Boomgaarden,et al.  The News Coverage of the 2004 European Parliamentary Election Campaign in 25 Countries , 2006 .

[22]  Gary King,et al.  Improving Anchoring Vignettes Designing Surveys to Correct Interpersonal Incomparability , 2010 .

[23]  D. Bortree,et al.  Dialogic strategies and outcomes: An analysis of environmental advocacy groups' Facebook profiles , 2009 .

[24]  D. Garrison,et al.  Methodological Issues in the Content Analysis of Computer Conference Transcripts , 2007 .

[25]  Baldwin Van Gorp,et al.  Where is the Frame? Victims and Intruders in the Belgian Press Coverage of the Asylum Issue , 2005 .

[26]  Klaus Krippendorff,et al.  Answering the Call for a Standard Reliability Measure for Coding Data , 2007 .

[27]  N. Ahmad,et al.  Environment disclosure in Malaysia annual reports: A legitimacy theory perspective , 2004 .

[28]  Edith D. de Leeuw,et al.  Survey Measurement and Process Quality: Lyberg/Survey , 1997 .

[29]  James N. Druckman,et al.  The Impact of Media Bias: How Editorial Slant Affects Voters , 2005, The Journal of Politics.

[30]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[31]  Martin J. Kifer,et al.  Timeless Strategy Meets New Medium: Going Negative on Congressional Campaign Web Sites, 2002–2006 , 2010 .

[32]  Kenneth Benoit,et al.  Treating Words as Data with Error: Uncertainty in Text Statements of Policy Positions , 2009 .

[33]  M. Seligman,et al.  Pessimistic Explanatory Style in the Historical Record: CAVing LBJ, Presidential Candidates, and East Versus West Berlin , 1988 .