Spark-Based Political Event Coding

Political event data have been widely used to study international politics. Previously, natural text processing and event generation required a lot of human efforts. Today we have high computing infrastructure with advance NLP metadata to leverage those tiresome efforts. TABARI -- an open source non distributed event-coding software -- was an early effort to generate events from a large corpus. It uses a shallow parser to identify the political actors, but ignores semantics and relation among the sentences. PETRARCH, the successor of TABARI, encodes event data into "who-did-what-to-whom" format. It uses Stanford CoreNLP to parse sentences and a static CAMEO dictionary to encode the data. To build dynamic dictionaries, we need to analyze more metadata such as the token, Named Entity Recognition (NER), co-reference, and many more from parsed sentences. Although these tools can code modest amounts of source corpora into event data they are too slow and suffer scalability issues when we try to extract metadata from a single document. The situation gets worse for other languages like Spanish or Arabic. In this paper, we develop a novel distributed framework using Apache Spark, MongoDB, Stanford CoreNLP, and PETRARCH. It shows a distributed workflow by using Stanford CoreNLP to extract all the metadata (parse tree, tokens, lemma, etc.) from the news corpora of the Gigaword dataset and storing it to MongoDB. Then it uses PETRARCH to encode events from the metadata. The framework integrates both tools using distributed commodity hardware and reduces text processing time substantially with respect to a non-distributed architecture. We have chosen Spark over traditional distributed frameworks like MapReduce, Storm, Mahout. Spark has in-memory computation and lower processing time in both batch and stream processing compared to other options.

[1]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[2]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Bhavani M. Thuraisingham,et al.  Real-time anomaly detection over VMware performance data using storm , 2014, Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014).

[5]  Paolo Nesi,et al.  A Distributed Framework for NLP-Based Keyword and Keyphrase Extraction From Web Pages and Documents , 2015, DMS.

[6]  Thomas Chadefaux,et al.  Early warning signals for war in the news , 2014 .

[7]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[8]  Philip A. Schrodt,et al.  Three's a Charm?: Open Event Data Coding with EL:DIABLO, PETRARCH, and the Open Event Data Alliance. , 2014 .

[9]  Rebecca H. Best,et al.  An analysis of the TABARI coding system , 2013 .

[10]  Philip A. Schrodt,et al.  The Kansas Event Data System: A Beginner's Guide with an Application to the Study of Media Fatigue in the Palestinian Intifada , 1996 .

[11]  Kristine Eck,et al.  In data we trust? A comparison of UCDP GED and ACLED conflict events datasets , 2012 .

[12]  G. Dale Thomas Scaling CAMEO: Psychophysical Magnitude Scaling of Conflict and Cooperation† , 2015 .

[13]  Yuri M. Zhukov,et al.  Filtering revolution , 2015 .

[14]  Matei Zaharia,et al.  Resilient Distributed Datasets , 2016 .

[15]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[16]  John Beieler,et al.  Generating Political Event Data in Near Real Time: Opportunities and Challenges , 2016, Computational Social Science.

[17]  Latifur Khan,et al.  Facing the reality of data stream classification: coping with scarcity of labeled data , 2012, Knowledge and Information Systems.

[18]  Bhavani M. Thuraisingham,et al.  Spark-based anomaly detection over multi-source VMware performance data in real-time , 2014, 2014 IEEE Symposium on Computational Intelligence in Cyber Security (CICS).

[19]  Scott Shenker,et al.  Fast and Interactive Analytics over Hadoop Data with Spark , 2012, login Usenix Mag..

[20]  David Van Brackle,et al.  Automated Coding of Political Event Data , 2013 .

[21]  Scott Shenker,et al.  Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing , 2012 .

[22]  E. Azar The Conflict and Peace Data Bank (COPDAB) Project , 1980 .

[23]  Nils B. Weidmann,et al.  Using machine-coded event data for the micro-level study of political violence , 2014 .

[24]  Ahmed Eldawy,et al.  CG_Hadoop: computational geometry in MapReduce , 2013, SIGSPATIAL/GIS.

[25]  Latifur Khan,et al.  GISQF: An Efficient Spatial Query Processing System , 2014, 2014 IEEE 7th International Conference on Cloud Computing.