Creating a Novel Geolocation Corpus from Historical Texts

This paper describes the process of annotating a historical US civil war corpus with geographic reference. Reference annotations are given at two different textual scales: individual place names and documents. This is the first published corpus of its kind in document-level geolocation, and it has over 10,000 disambiguated toponyms, double the amount of any prior toponym corpus. We outline many challenges and considerations in creating such a corpus, and we evaluate baseline and benchmark toponym resolution and document geolocation systems on it. Aspects of the corpus suggest several recommendations for proper annotation procedure for the tasks.

[1]  Christopher D. Manning,et al.  Nested Named Entity Recognition , 2009, EMNLP.

[2]  Jochen L. Leidner Toponym resolution in text: annotation, evaluation and applications of spatial grounding , 2007, SIGF.

[3]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[4]  Andrea Esuli,et al.  CoPhIR: a Test Collection for Content-Based Image Retrieval , 2009, ArXiv.

[5]  Jason Baldridge,et al.  Gazetteer-Independent Toponym Resolution Using Geographic Word Profiles , 2015, AAAI.

[6]  Bruno Martins,et al.  Using machine learning methods for disambiguating place references in textual documents , 2014, GeoJournal.

[7]  Max Mühlhäuser,et al.  A Multi-Indicator Approach for Geolocalization of Tweets , 2013, ICWSM.

[8]  Scott Nesbit Visualizing Emancipation: Mapping the End of Slavery in the American Civil War , 2014, Computation for Humanity.

[9]  Jason Baldridge,et al.  Hierarchical Discriminative Classification for Text-Based Geolocation , 2014, EMNLP.

[10]  William G. Thomas,et al.  The Iron Way: Railroads, the Civil War, and the Making of Modern America , 2011 .

[11]  Y. Sternhell The Afterlives of a Confederate Archive: Civil War Documents and the Making of Sectional Reconciliation , 2016 .

[12]  Naoaki Okazaki,et al.  Annotating Geographical Entities on Microblog Text , 2015, LAW@NAACL-HLT.

[13]  Hanan Samet,et al.  Adaptive context features for toponym resolution in streaming news , 2012, SIGIR '12.

[14]  Steven Schockaert,et al.  Georeferencing Wikipedia pages using language models from Flickr , 2011, ISWC 2011.

[15]  Timothy Baldwin,et al.  Text-Based Twitter User Geolocation Prediction , 2014, J. Artif. Intell. Res..

[16]  Xiangji Huang,et al.  Mining query-driven contexts for geographic and temporal search , 2013, Int. J. Geogr. Inf. Sci..

[17]  Jason Baldridge,et al.  Supervised Text-based Geolocation Using Language Models on an Adaptive Grid , 2012, EMNLP.

[18]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[19]  Jason Baldridge,et al.  Text-Driven Toponym Resolution using Indirect Supervision , 2013, ACL.

[20]  Benjamin Patai Wing Text-based document geolocation and its application to the digital humanities , 2015 .

[21]  S. Nesbit,et al.  Seeing Emancipation: Scale and Freedom in the American South , 2011 .

[22]  Steven Schockaert,et al.  Georeferencing Wikipedia Documents Using Data from Social Media Sources , 2014, ACM Trans. Inf. Syst..

[23]  K. Baldwin The Visual Documentation of Antietam: Peaceful Settings, Morbid Curiosity, and a Profitable Business , 2011 .

[24]  J. Altham Naming and necessity. , 1981 .

[25]  Vanessa Murdock,et al.  Modeling locations with social media , 2013, Information Retrieval.

[26]  Claire Grover,et al.  Use of the Edinburgh geoparser for georeferencing digitized historical collections , 2010, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.