Tradeoffs in Accuracy and Efficiency in Supervised Learning Methods

ABSTRACT Words are an increasingly important source of data for social science research. Automated classification methodologies hold the promise of substantially lowering the costs of analyzing large amounts of text. In this article, we consider a number of questions of interest to prospective users of supervised learning methods, which are used to automatically classify events based on a pre-existing classification system. Although information scientists devote considerable attention to assessing the performance of different supervised learning algorithms and feature representations, the questions asked are often less directly relevant to the more practical concerns of social scientists. The first question prospective social science users are likely to ask is, How well do such methods work? The second is, How much human labeling effort is required? The third is, How do we assess whether virgin cases have been automatically classified with sufficient accuracy? We address these questions in the context of a particular dataset—the Congressional Bills Project—which includes more than 400,000 bill titles that humans have classified into 20 policy topics. This corpus offers an unusual opportunity to assess the performance of different algorithms, the impact of sample size, and the benefits of ensemble learning as a means for estimating classification accuracy.

[1]  Gary King,et al.  A Method of Automated Nonparametric Content Analysis for Social Science , 2010 .

[2]  Claire Cardie,et al.  Text Annotation for Political Science Research , 2008 .

[3]  David L. Olson,et al.  Advanced Data Mining Techniques , 2008 .

[4]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[5]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[6]  B. Carpenter,et al.  LingPipe for 99.99% Recall of Gene Mentions , 2007 .

[7]  M. Laver,et al.  Extracting Policy Positions from Political Texts Using Words as Data , 2003, American Political Science Review.

[8]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[9]  Gary King,et al.  Quantitative Discovery from Qualitative Information: A General-Purpose Document Clustering Methodology , 2009 .

[10]  Adwait Ratnaparkhi,et al.  A Simple Introduction to Maximum Entropy Models for Natural Language Processing , 1997 .

[11]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[12]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[13]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[14]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[15]  Gary King,et al.  An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design , 2003, International Organization.

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[17]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[18]  Dustin Hillard,et al.  Computer-Assisted Topic Classification for Mixed-Methods Social Science Research , 2008 .

[19]  Philip A. Schrodt,et al.  Political Science: KEDS—A Program for the Machine Coding of Event Data , 1994 .

[20]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[21]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[22]  Philip A. Schrodt,et al.  Introduction to the Special Issue: The Statistical Analysis of Political Text , 2008, Political Analysis.

[23]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning , 2011, Encyclopedia of Machine Learning.

[24]  Amber E. Boydstun,et al.  RTextTools: A Supervised Learning Package for Text Classification , 2013, R J..

[25]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[28]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[29]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.