Computer-Assisted Topic Classification for Mixed-Methods Social Science Research

ABSTRACT Social scientists interested in mixed-methods research have traditionally turned to human annotators to classify the documents or events used in their analyses. The rapid growth of digitized government documents in recent years presents new opportunities for research but also new challenges. With more and more data coming online, relying on human annotators becomes prohibitively expensive for many tasks. For researchers interested in saving time and money while maintaining confidence in their results, we show how a particular supervised learning system can provide estimates of the class of each document (or event). This system maintains high classification accuracy and provides accurate estimates of document proportions, while achieving reliability levels associated with human efforts. We estimate that it lowers the costs of classifying large numbers of complex documents by 80% or more.

[1]  Philip A. Schrodt,et al.  Validity Assessment of a Machine-Coded Event Data Set for the Middle East, 1982-92 , 1994 .

[2]  Kishore Papineni,et al.  Why Inverse Document Frequency? , 2001, NAACL.

[3]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[4]  B. Jones,et al.  The Politics of Attention: How Government Prioritizes Problems , 2006 .

[5]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[6]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[7]  James E. Purpura,et al.  An Active Learning Framework for Classifying Political Text , 2007 .

[8]  Gideon S. Mann,et al.  Bibliometric impact measures leveraging topic analysis , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[9]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[10]  Jeffrey A. Segal,et al.  The Supreme Court and the Attitudinal Model Revisited , 1993 .

[11]  Eric Brill,et al.  Classifier Combination for Improved Lexical Disambiguation , 1998, ACL.

[12]  Gary King,et al.  Extracting Systematic Social Science Meaning from Text 1 , 2007 .

[13]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[14]  John D. Wilkerson,et al.  Intended Consequences: Jurisdictional Reform and Issue Control In the U.S. House of Representatives , 2008 .

[15]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[16]  M. Just Soft News Goes to War: Public Opinion and American Foreign Policy in the New Media Age , 2006, Perspectives on Politics.

[17]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[18]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[19]  Gary King,et al.  An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design , 2003, International Organization.

[20]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[21]  James Curran,et al.  Ensemble Methods for Automatic Thesaurus Extraction , 2002, EMNLP.

[22]  Dragomir R. Radev,et al.  An Automated Method of Topic-Coding Legislative Speech Over Time with Application to the 105th-108th U.S. Senate , 2006 .

[23]  Robert O. Keohane,et al.  Designing Social Inquiry: Scientific Inference in Qualitative Research. , 1995 .

[24]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[25]  K. T. Poole,et al.  Congress: A Political-Economic History of Roll Call Voting , 1997 .

[26]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[27]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[28]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[29]  Dustin Hillard,et al.  Automated classification of congressional legislation , 2006, DG.O.

[30]  M. Laver,et al.  Extracting Policy Positions from Political Texts Using Words as Data , 2003, American Political Science Review.

[31]  Philip A. Schrodt,et al.  Political Science: KEDS—A Program for the Machine Coding of Event Data , 1994 .

[32]  K. Gwet Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement Between Raters , 2002 .