SPERG: Scalable Political Event Report Geoparsing in Big Data

Digital newspaper archives accumulated over the last few decades serve as an easily accessible, rich source of information for researchers to conduct analytic studies. Extracting unambiguous geographic identifiers, such as the geographic coordinates solely from text descriptions, a process also known as geoparsing, has proven to be a major challenge when applied to massive corpora like newspaper archives. We focus primarily on archived newspaper reports on political events and aim to parse the exact event location with high accuracy. We identify all the focus locations, which includes all locations mentioned in a report along with the coordinates and task ourselves with recognizing the location where the event actually occurred which we define as our primary focus location. Our objective is to extract the latitude-longitude information of these primary focus locations. Existing geoparsers only partially serve the purpose and are not robust enough to process large data archives in reasonable time. In this paper we propose a framework to extract geolocation information of primary focus location from 76 million documents in a distributed environment.