New Techniques for Coding Political Events across Languages

Data produced from text is one of the most important new resources for doing research in quantitative political science research. "Event data," which codes structured actor-event-target triples from text, is a particularly useful form of data. Most publicly available event datasets, though, are limited to English only, limiting their usefulness for studying many regions. We demonstrate new techniques for coding events in English, as well as in Arabic, a previously uncoded language. In order to generate language-specific political event data, "actor" and "verb" dictionaries are required for each specific language. Efficiently developing an accurate and extensive dictionaries is a difficult challenge. In this paper, we describe four different approaches we have used to solve the problem of producing dictionaries and how other researchers can use our ideas to develop dictionaries in a new language or new ontology. This work stems from an ongoing NSF RIDIR project, "Modernizing Political Event Data" which aims to produce multilingual event data and the software needed for researchers to produce custom datasets.