A Progressive Supervised-learning Approach to Generating Rich Civil Strife Data

“Big data” in the form of unstructured text pose challenges and opportunities to social scientists committed to advancing research frontiers. Because machine-based and human-centric approaches to content analysis have different strengths for extracting information from unstructured text, the authors argue for a collaborative, hybrid approach that combines their comparative advantages. The notion of a progressive supervised-learning approach that combines data science techniques and human coders is developed and illustrated using the Social, Political and Economic Event Database (SPEED) project’s Societal Stability Protocol. SPEED’s rich event data on civil strife reveal that conventional machine-based approaches for generating event data miss a great deal of within-category variance, while conventional human-based efforts to categorize periods of civil war or political instability routinely misspecify periods of calm and unrest. To demonstrate the potential of hybrid data collection methods, SPEED data on event intensities and origins are used to trace the changing role of political, socioeconomic, and sociocultural factors in generating global civil strife in the post–World War II era.

[1]  O. Holsti,et al.  An adaptation of the "General Inquirer" for the systematic analysis of political documents. , 1964, Behavioral science.

[2]  D. Snow,et al.  Social Scientific Inquiry Into Genocide and Mass Killing: From Unitary Outcome to Complex Processes , 2013 .

[3]  William D. Goran Center for the Advancement of Sustainability Innovations (CASI): A Summary of the Center's First Year's Activities , 2008 .

[4]  Philip A. Schrodt,et al.  Introduction to the Special Issue: The Statistical Analysis of Political Text , 2008, Political Analysis.

[5]  Nello Cristianini,et al.  Network analysis of narrative content in large corpora , 2013, Natural Language Engineering.

[6]  Dustin Hillard,et al.  Computer-Assisted Topic Classification for Mixed-Methods Social Science Research , 2008 .

[7]  Svitlana Chernykh,et al.  Assumed Transmission in Political Science: A Call for Bringing Description Back In , 2011 .

[8]  J. Goldstone,et al.  A Global Model for Forecasting Political Instability , 2010 .

[9]  A. Shapiro,et al.  National Consortium for the Study of Terrorism and Responses to Terrorism , 2010 .

[10]  Mike Conway,et al.  The Subjective Precision of Computers: A Methodological Comparison with Human Coding in Content Analysis , 2006 .

[11]  Roberto Franzosi,et al.  The Press as a Source of Socio-Historical Data: Issues in the Methodology of Data Collection from Newspapers , 1987 .

[12]  Yang Su,et al.  The War at Home : Antiwar Protests and Congressional Voting , 1965 to 1973 Author ( s ) : , 2007 .

[13]  Colin Seymour-Ure,et al.  Content Analysis in Communication Research. , 1972 .

[14]  Kimberly A. Neuendorf,et al.  The Content Analysis Guidebook , 2001 .

[15]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[16]  Peter Wallensteen,et al.  Armed Conflict 1946-2001: A New Dataset , 2002 .

[17]  Jimmy J. Lin,et al.  Recounting the Courts? Applying Automated Content Analysis to Enhance Empirical Legal Research: Automated Content Analysis to Enhance Empirical Legal Research , 2007 .

[18]  Bruce Thompson,et al.  Stepwise Regression and Stepwise Discriminant Analysis Need Not Apply here: A Guidelines Editorial , 1995 .

[19]  M. H. Danzger Validating Conflict Data , 1975 .

[20]  Jana Vogel,et al.  From Words To Numbers Narrative Data And Social Science , 2016 .

[21]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[22]  H. Urdal,et al.  Explaining Urban Social Disorder and Violence: An Empirical Study of Event Data from Asian and Sub-Saharan African Cities , 2012 .

[23]  Clionadh Raleigh,et al.  Violence Against Civilians: A Disaggregated Analysis , 2012 .

[24]  Joe Bond,et al.  Integrated Data for Events Analysis (IDEA): An Event Typology for Automated Events Data Development , 2003 .

[25]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[26]  A. Walder Political Sociology and Social Movements , 2009 .

[27]  Tor A. Benjaminsen,et al.  States, Scarcity, and Civil Strife in the Developing World , 2006 .

[28]  N SorokaStuart The Gatekeeping Function: Distributions of Information in Media and the Real World , 2012 .

[29]  Kristine Eck,et al.  In data we trust? A comparison of UCDP GED and ACLED conflict events datasets , 2012 .

[30]  E. Amenta,et al.  The Political Consequences of Social Movements , 2010 .

[31]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[32]  Clionadh Raleigh,et al.  Introducing ACLED: An Armed Conflict Location and Event Dataset , 2010 .

[33]  W. James Potter,et al.  Rethinking validity and reliability in content analysis , 1999 .

[34]  Philip A. Schrodt,et al.  Conflict and Mediation Event Observations (CAMEO): A New Event Data Framework for the Analysis of Foreign Policy Interactions , 2002 .

[35]  Sean P. O'Brien,et al.  Crisis Early Warning and Decision Support: Contemporary Approaches and Thoughts on Future Research , 2010 .

[36]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[37]  Roberto Franzosi,et al.  Ways of Measuring Agency , 2012 .

[38]  Luis Alfonso Ureña López,et al.  Sentiment analysis in Twitter , 2012, Natural Language Engineering.

[39]  Stuart Soroka,et al.  Affective News: The Automated Coding of Sentiment in Political Texts , 2012 .

[40]  J. Fearon,et al.  Ethnicity, Insurgency, and Civil War , 2003, American Political Science Review.

[41]  Jimmy J. Lin,et al.  Recounting the Courts? Applying Automated Content Analysis to Enhance Empirical Legal Research , 2006 .

[42]  Jennifer Earl,et al.  Political Repression: Iron Fists, Velvet Gloves, and Diffuse Control , 2011 .

[43]  John T. Woolley Using Media-Based Data in Studies of Politics , 2000 .

[44]  Justin Grimmer,et al.  Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[45]  Philip J. Stone,et al.  The general inquirer: A computer system for content analysis and retrieval based on the sentence as a unit of information , 2007 .

[46]  Thomas Bernauer,et al.  New Event Data in Conflict Research , 2012 .

[47]  William A. Boyd,et al.  Mutiple Sources in the Collection of Data on Political Conflict , 1979 .

[48]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[49]  Dragomir R. Radev,et al.  How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[50]  Sven Chojnacki,et al.  Event Data on Armed Conflict and Security: New Perspectives, Old Challenges, and Some Solutions , 2012 .

[51]  Gary King,et al.  General purpose computer-assisted clustering and conceptualization , 2011, Proceedings of the National Academy of Sciences.

[52]  N. P. Gleditsch,et al.  Monitoring Trends in Global Combat: A New Dataset of Battle Deaths , 2005 .

[53]  Gary King,et al.  An Automated Information Extraction Tool for International Conflict Data with Performance as Good as Human Coders: A Rare Events Evaluation Design , 2003, International Organization.

[54]  Robin Wagner-Pacifici,et al.  Resolution of Social Conflict , 2012 .

[55]  Lotta Themnér,et al.  Armed conflict, 1946–2010 , 2011 .

[56]  Philip A. Schrodt Precedents, Progress, and Prospects in Political Event Data , 2012 .

[57]  J. Armstrong,et al.  Derivation of Theory by Means of Factor Analysis or Tom Swift and His Electric Factor Analysis Machine , 2015 .

[58]  Wouter van Atteveldt,et al.  Good News or Bad News? Conducting Sentiment Analysis on Dutch Text to Distinguish Between Positive and Negative Relations , 2008 .

[59]  Cullen S. Hendrix,et al.  Social Conflict in Africa: A New Database , 2012 .