论文信息 - Proportional Classification Revisited: Automatic Content Analysis of Political Manifestos Using Active Learning

Proportional Classification Revisited: Automatic Content Analysis of Political Manifestos Using Active Learning

Supervised machine learning is a promising methodological innovation for content analysis (CA) to approach the challenge of ever-growing amounts of text in the digital era. Social scientists have pointed to accurate measurement of category proportions and trends in large collections as their primary goal. Proportional classification, for example, allows for time-series analysis of diachronic data sets or correlation of categories with text-external covariates. We evaluate the performance of two common approaches for this goal: a method based on regression analysis with feature profiles from entire collections and a method aggregating classifier decisions for individual documents. For both, we observed a significant negative effect on classification performance due to the uneven distribution of characteristic language structures within the text collection. For proportional classification, this poses considerable problems. To fix this problem, we propose a workflow of active learning, which alternates between machine learning and human coding. Results from experiments with empirical data (political manifestos) demonstrate that active learning enables researchers to create training sets for automatic CA efficiently, reliably, and with high accuracy for the desired goal while retaining control over the automatic process.

Gregor Wiedemann | Gregor Wiedemann

[1] Daan Odijk,et al. Teaching the Computer to Code Frames in News: Comparing Two Supervised Machine Learning Approaches to Frame Analysis , 2014 .

[2] Kenneth Benoit,et al. Coder Reliability and Misclassification in the Human Coding of Party Manifestos , 2012, Political Analysis.

[3] Andreas Jungherr,et al. Digital Trace Data in the Study of Public Opinion , 2017 .

[4] Gary King,et al. A Method of Automated Nonparametric Content Analysis for Social Science , 2010 .

[5] Vito D'Orazio,et al. Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines , 2014, Political Analysis.

[6] Dustin Hillard,et al. Computer-Assisted Topic Classification for Mixed-Methods Social Science Research , 2008 .

[7] Gregor Wiedemann. Text Mining for Qualitative Data Analysis in the Social Sciences , 2016 .

[8] Seth C. Lewis,et al. Content Analysis and the Algorithmic Coder , 2015 .

[9] Justin Grimmer,et al. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[10] S. Iacus,et al. Using Sentiment Analysis to Monitor Electoral Campaigns , 2015 .

[11] Justin Grimmer,et al. Appropriators not Position Takers: The Distorting Effects of Electoral Incentives on Congressional Representation , 2013 .