Data produced from text is one of the most important new resources for doing research in quantitative political science research. "Event data," which codes structured actor-event-target triples from text, is a particularly useful form of data. Most publicly available event datasets, though, are limited to English only, limiting their usefulness for studying many regions. We demonstrate new techniques for coding events in English, as well as in Arabic, a previously uncoded language. In order to generate language-specific political event data, "actor" and "verb" dictionaries are required for each specific language. Efficiently developing an accurate and extensive dictionaries is a difficult challenge. In this paper, we describe four different approaches we have used to solve the problem of producing dictionaries and how other researchers can use our ideas to develop dictionaries in a new language or new ontology. This work stems from an ongoing NSF RIDIR project, "Modernizing Political Event Data" which aims to produce multilingual event data and the software needed for researchers to produce custom datasets.
[1]
Philip A. Schrodt,et al.
Conflict and Mediation Event Observations (CAMEO): A New Event Data Framework for the Analysis of Foreign Policy Interactions
,
2002
.
[2]
Michael I. Jordan,et al.
Latent Dirichlet Allocation
,
2001,
J. Mach. Learn. Res..
[3]
Alejandro Reyes,et al.
Supervised Event Coding From Text Written in Spanish
,
2017
.
[4]
Mihai Surdeanu,et al.
The Stanford CoreNLP Natural Language Processing Toolkit
,
2014,
ACL.
[5]
Jakub Piskorski,et al.
On the Creation of a Security-Related Event Corpus
,
2017,
NEWS@ACL.
[6]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.
[7]
Hava T. Siegelmann,et al.
Support Vector Clustering
,
2002,
J. Mach. Learn. Res..
[8]
Jeffrey Dean,et al.
Distributed Representations of Words and Phrases and their Compositionality
,
2013,
NIPS.