A deeply annotated testbed for geographical text analysis: The Corpus of Lake District Writing

This paper describes the development of an annotated corpus which forms a challenging testbed for geographical text analysis methods. This dataset, the Corpus of Lake District Writing (CLDW), consists of 80 manually digitised and annotated texts (comprising over 1.5 million word tokens). These texts were originally composed between 1622 and 1900, and they represent a range of different genres and authors. Collectively, the texts in the CLDW constitute an indicative sample of writing about the English Lake District during the early seventeenth century and the early twentieth century. The corpus is annotated more deeply than is currently possible with vanilla Named Entity Recognition, Disambiguation and geoparsing. This is especially true of the geographical information the corpus contains, since we have undertaken not only to link different historical and spelling variants of place-names, but also to identify and to differentiate geographical features such as waterfalls, woodlands, farms or inns. In addition, we illustrate the potential of the corpus as a gold standard by evaluating the results of three different NLP libraries and geoparsers on its contents. In the evaluation, the standard NER processing of the text by the different NLP libraries produces many false positive and false negative results, showing the strength of the gold standard.

[1]  Andrew Hardie,et al.  Which 'Lancaster' do you mean? Disambiguation challenges in extracting place names for Spatial Humanities , 2012 .

[2]  Ian N. Gregory,et al.  Locating the beautiful, picturesque, sublime and majestic:spatially analysing the application of aesthetic terminology in descriptions of the English Lake District , 2017 .

[3]  Jason Baldridge,et al.  Text-Driven Toponym Resolution using Indirect Supervision , 2013, ACL.

[4]  Claire Grover,et al.  Use of the Edinburgh geoparser for georeferencing digitized historical collections , 2010, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[5]  Scott Nesbit,et al.  Creating a Novel Geolocation Corpus from Historical Texts , 2016, LAW@ACL.

[6]  Henk J. Scholten,et al.  An Introduction to Geographical Information Systems , 1995 .

[7]  I. Gregory,et al.  Geographical Text Analysis: Digital Cartographies of Lake District Literature , 2016 .

[8]  Ian N. Gregory,et al.  Dealing with heterogeneous big data when geoparsing historical corpora , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[9]  Andrew Hardie,et al.  Automatically Analyzing Large Texts in a GIS Environment: The Registrar General's Reports and Cholera in the 19th Century , 2015, Trans. GIS.

[10]  Leif Isaksen,et al.  Exploring Pelagios: a visual browser for geo-tagged datasets , 2012 .

[11]  Ian N. Gregory,et al.  Customising geoparsing and georeferencing for historical texts , 2013, 2013 IEEE International Conference on Big Data.

[12]  Rosamaria Salvatori,et al.  One name one place? Dealing with toponyms in WWI , 2018 .

[13]  Jochen L. Leidner Toponym resolution in text: annotation, evaluation and applications of spatial grounding , 2007, SIGF.

[14]  Nigel Collier,et al.  What’s missing in geographical parsing? , 2017, Language Resources and Evaluation.

[15]  Zaiqing Nie,et al.  Joint Entity Recognition and Disambiguation , 2015, EMNLP.

[16]  D. Unwin Geographical information systems and the problem of 'error and uncertainty' , 1995 .

[17]  Ian Gregory,et al.  Geographical Text Analysis: A new approach to understanding nineteenth-century mortality. , 2015, Health & place.

[18]  Caroline Sporleder,et al.  Toponym disambiguation in historical documents using semantic and geographic features , 2017, DATeCH.

[19]  Ian N. Gregory,et al.  Alts, Abbreviations, and AKAs: Historical Onomastic Variation and Automated Named Entity Recognition , 2017 .

[20]  D. I. Heywood,et al.  An Introduction to Geographical Information Systems , 2002 .

[21]  Andrew Hardie,et al.  Automatically analysing large texts in a GIS environment: The Registrar General’s reports and cholera in the nineteenth century , 2015 .

[22]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[23]  John G. Keating,et al.  Mapping and Unmapping Joyce: Geoparsing Wandering Rocks , 2014, DH.

[24]  Daniel Defoe,et al.  A tour thro' the whole island of Great Britain, divided into circuits or journies ... : with which is included a set of maps ... , 1928 .

[25]  Christopher D. Manning,et al.  Joint Parsing and Named Entity Recognition , 2009, NAACL.

[26]  M. Goodchild,et al.  Geographic Information Systems and Science (second edition) , 2001 .

[27]  Erik Rauch,et al.  A confidence-based framework for disambiguating geographic terms , 2003, HLT-NAACL 2003.

[28]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[29]  Matthew Wilkens,et al.  The Geographic Imagination of Civil War-Era American Fiction , 2013 .

[30]  Karel Vaculík,et al.  Perseus Digital Library , 2008 .