Learning Curves for Automating Content Analysis: How Much Human Annotation is Needed?

In this paper, we explore the potential for reducing human effort when coding text segments for use in content analysis. The key idea is to do some coding by hand, to use the results of that initial effort as training data, and then to code the remainder of the content automatically. The test collection includes 102 written prepared statements about Net neutrality from public hearings held by the U.S Congress and the U.S. Federal Communications Commission (FCC). Six categories used in this analysis: wealth, social order, justice, freedom, innovation and honor. A support vector machine (SVM) classifier and a Naïve Bayes (NB) classifier were trained on manually annotated sentences from between one and 51 documents and tested on a held out of set of 51 documents. The results show that the inflection point for a standard measure of classifier accuracy (F1) occurs early, reaching at least 85% of the best achievable result by the SVM classifier with only 30 training documents, and at least 88% of the best achievable result by NB classifier with only 30 training documents. With the exception of honor, the results show that the scale of machine classification would reasonably be scaled up to larger collections of similar documents without additional human annotation effort.

[1]  Kenneth R. Fleischmann,et al.  The relationship between human values and attitudes toward the Park51 and nuclear power controversies , 2011, ASIST.

[2]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[3]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[4]  Kenneth Fleischmann,et al.  Information and Human Values , 2013, Information and Human Values.

[5]  An-Shou Cheng Values in the Net Neutrality Debate: Applying Content Analysis to Testimonies from Public Hearings , 2012 .

[6]  Gerard Salton,et al.  The SMART and SIRE experimental retrieval systems , 1997 .

[7]  Gary King,et al.  A Method of Automated Nonparametric Content Analysis for Social Science , 2010 .

[8]  Douglas W. Oard,et al.  A Word-Scale Probabilistic Latent Variable Model for Detecting Human Values , 2014, CIKM.

[9]  Jimmy J. Lin,et al.  Recounting the Courts? Applying Automated Content Analysis to Enhance Empirical Legal Research , 2006 .

[10]  M. Laver,et al.  Extracting Policy Positions from Political Texts Using Words as Data , 2003, American Political Science Review.

[11]  Douglas W. Oard,et al.  Improving Automatic Sentence-Level Annotation of Human Values Using Augmented Feature Vectors , 2013 .

[12]  Ping Wang,et al.  The role of innovation and wealth in the net neutrality debate: A content analysis of human values in congressional and FCC hearings , 2012, J. Assoc. Inf. Sci. Technol..

[13]  Byron C. Wallace,et al.  Automatically Annotating Topics in Transcripts of Patient-Provider Interactions via Machine Learning , 2014, Medical decision making : an international journal of the Society for Medical Decision Making.

[14]  Kevin Crowston,et al.  Semi-Automatic Content Analysis of Qualitative Data , 2014 .

[15]  Kenneth R. Fleischmann,et al.  Developing a meta-inventory of human values , 2010, ASIST.

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Kenneth R. Fleischmann,et al.  Simulating Audiences: Automating Analysis of Values, Attitudes, and Sentiment , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[18]  Jordan L. Boyd-Graber,et al.  Content Analysis for Values Elicitation , 2012 .