Proportional Classification Revisited: Automatic Content Analysis of Political Manifestos Using Active Learning

Supervised machine learning is a promising methodological innovation for content analysis (CA) to approach the challenge of ever-growing amounts of text in the digital era. Social scientists have pointed to accurate measurement of category proportions and trends in large collections as their primary goal. Proportional classification, for example, allows for time-series analysis of diachronic data sets or correlation of categories with text-external covariates. We evaluate the performance of two common approaches for this goal: a method based on regression analysis with feature profiles from entire collections and a method aggregating classifier decisions for individual documents. For both, we observed a significant negative effect on classification performance due to the uneven distribution of characteristic language structures within the text collection. For proportional classification, this poses considerable problems. To fix this problem, we propose a workflow of active learning, which alternates between machine learning and human coding. Results from experiments with empirical data (political manifestos) demonstrate that active learning enables researchers to create training sets for automatic CA efficiently, reliably, and with high accuracy for the desired goal while retaining control over the automatic process.

[1]  Daan Odijk,et al.  Teaching the Computer to Code Frames in News: Comparing Two Supervised Machine Learning Approaches to Frame Analysis , 2014 .

[2]  Kenneth Benoit,et al.  Coder Reliability and Misclassification in the Human Coding of Party Manifestos , 2012, Political Analysis.

[3]  Andreas Jungherr,et al.  Digital Trace Data in the Study of Public Opinion , 2017 .

[4]  Gary King,et al.  A Method of Automated Nonparametric Content Analysis for Social Science , 2010 .

[5]  Vito D'Orazio,et al.  Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines , 2014, Political Analysis.

[6]  Dustin Hillard,et al.  Computer-Assisted Topic Classification for Mixed-Methods Social Science Research , 2008 .

[7]  Gregor Wiedemann Text Mining for Qualitative Data Analysis in the Social Sciences , 2016 .

[8]  Seth C. Lewis,et al.  Content Analysis and the Algorithmic Coder , 2015 .

[9]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[10]  S. Iacus,et al.  Using Sentiment Analysis to Monitor Electoral Campaigns , 2015 .

[11]  Justin Grimmer,et al.  Appropriators not Position Takers: The Distorting Effects of Electoral Incentives on Congressional Representation , 2013 .

[12]  Davide Di Fatta,et al.  Content and Sentiment Analysis on Online Social Networks (OSNs) , 2017 .

[13]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[14]  Susumu Horiguchi,et al.  A Hidden Topic-Based Framework toward Building Applications with Short Web Documents , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[16]  Alfred Hermida,et al.  Content Analysis in an Era of Big Data: A Hybrid Approach to Computational and Manual Methods , 2013 .

[17]  J. Lewandowski,et al.  The Manifesto Corpus: A new resource for research on political parties and quantitative text analysis , 2016 .

[18]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[19]  Loren Collingwood,et al.  Tradeoffs in Accuracy and Efficiency in Supervised Learning Methods , 2012 .

[20]  T. Hennig-Thurau,et al.  Not all digital word of mouth is created equal: Understanding the respective impact of consumer reviews and microblogs on new product success , 2017 .

[21]  John D. Wilkerson,et al.  Large-Scale Computerized Text Analysis in Political Science: Opportunities and Challenges , 2017 .

[22]  Claes H. de Vreese,et al.  Using Supervised Machine Learning to Code Policy Issues , 2015 .

[23]  Kevin Crowston,et al.  Optimizing Features in Active Machine Learning for Complex Qualitative Content Analysis , 2014, LTCSS@ACL.

[24]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[25]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[26]  Michael Scharkow,et al.  Thematic content analysis using supervised machine learning: An empirical evaluation using German online news , 2011, Quality & Quantity.

[27]  Dominique Brossard,et al.  Analyzing public sentiments online: combining human- and computer-based content analysis , 2017 .