Clause Analysis: Using Syntactic Information to Automatically Extract Source, Subject, and Predicate from Texts with an Application to the 2008–2009 Gaza War

This article presents a new method and open source R package that uses syntactic information to automatically extract source–subject–predicate clauses. This improves on frequency-based text analysis methods by dividing text into predicates with an identified subject and optional source, extracting the statements and actions of (political) actors as mentioned in the text. The content of these predicates can be analyzed using existing frequency-based methods, allowing for the analysis of actions, issue positions and framing by different actors within a single text. We show that a small set of syntactic patterns can extract clauses and identify quotes with good accuracy, significantly outperforming a baseline system based on word order. Taking the 2008–2009 Gaza war as an example, we further show how corpus comparison and semantic network analysis applied to the results of the clause analysis can show differences in citation and framing patterns between U.S. and English-language Chinese coverage of this war.

[1]  Marshall S. Smith,et al.  The general inquirer: A computer approach to content analysis. , 1967 .

[2]  Robert M. Entman,et al.  Theorizing Mediated Public Diplomacy: The U.S. Case , 2008 .

[3]  Charles J. Fillmore,et al.  The Structure of the Framenet Database , 2003 .

[4]  Philip J. Stone,et al.  Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[5]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[6]  Margaret E. Roberts Introduction to the Virtual Issue: Recent Innovations in Text Analysis for Social Science , 2016, Political Analysis.

[7]  Philip A. Schrodt,et al.  Validity Assessment of a Machine-Coded Event Data Set for the Middle East, 1982-92 , 1994 .

[9]  Ophir Frieder,et al.  Information Retrieval: Algorithms and Heuristics , 1998 .

[10]  Wouter van Atteveldt,et al.  Parsing, Semantic Networks, and Political Authority Using Syntactic Analysis to Extract Semantic Relations from Dutch Newspaper Articles , 2008, Political Analysis.

[11]  W. H. van Atteveldt,et al.  Semantic Network Analysis: Techniques for Extracting, Representing, and Querying Media Content , 2008 .

[12]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[13]  Sven-Oliver Proksch,et al.  A Scaling Model for Estimating Time-Series Party Positions from Texts , 2007 .

[14]  Loren Collingwood,et al.  Tradeoffs in Accuracy and Efficiency in Supervised Learning Methods , 2012 .

[15]  Kenneth Benoit,et al.  Validating Estimates of Latent Traits from Textual Data Using Human Judgment as a Benchmark , 2012, Political Analysis.

[16]  Philip A. Schrodt,et al.  Cluster-Based Early Warning Indicators for Political Change in the Contemporary Levant , 2000, American Political Science Review.

[17]  Wouter van Atteveldt,et al.  Global Angling with a Local Angle: How U.S., British, and Dutch Newspapers Frame Global and Local Terrorist Attacks , 2007 .

[18]  Justin Grimmer,et al.  A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[19]  Margaret E. Roberts,et al.  Computer-Assisted Text Analysis for Comparative Politics , 2015, Political Analysis.

[20]  Stuart Soroka,et al.  Affective News: The Automated Coding of Sentiment in Political Texts , 2012 .

[21]  Dustin Hillard,et al.  Computer-Assisted Topic Classification for Mixed-Methods Social Science Research , 2008 .

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[24]  Xavier Carreras,et al.  Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling , 2005, CoNLL.

[25]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[26]  Shaul R. Shenhav,et al.  Relative Political and Value Proximity in Mediated Public Diplomacy: The Effect of State-Level Homophily on International Frame Building , 2014 .

[27]  Vito D'Orazio,et al.  Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines , 2014, Political Analysis.

[28]  M. Laver,et al.  Extracting Policy Positions from Political Texts Using Words as Data , 2003, American Political Science Review.

[29]  Burt L. Monroe,et al.  Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict , 2008, Political Analysis.

[30]  Shaul R. Shenhav,et al.  Mediated Public Diplomacy in a New Era of Warfare , 2009 .

[31]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[32]  Noah A. Smith,et al.  SEMAFOR: Frame Argument Resolution with Log-Linear Models , 2010, SemEval@ACL.

[33]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.