Adaptive Geoparsing Method for Toponym Recognition and Resolution in Unstructured Text

The automatic extraction of geospatial information is an important aspect of data mining. Computer systems capable of discovering geographic information from natural language involve a complex process called geoparsing, which includes two important tasks: geographic entity recognition and toponym resolution. The first task could be approached through a machine learning approach, in which case a model is trained to recognize a sequence of characters (words) corresponding to geographic entities. The second task consists of assigning such entities to their most likely coordinates. Frequently, the latter process involves solving referential ambiguities. In this paper, we propose an extensible geoparsing approach including geographic entity recognition based on a neural network model and disambiguation based on what we have called dynamic context disambiguation. Once place names are recognized in an input text, they are solved using a grammar, in which a set of rules specifies how ambiguities could be solved, in a similar way to that which a person would utilize, considering the context. As a result, we have an assignment of the most likely geographic properties of the recognized places. We propose an assessment measure based on a ranking of closeness relative to the predicted and actual locations of a place name. Regarding this measure, our method outperforms OpenStreetMap Nominatim. We include other assessment measures to assess the recognition ability of place names and the prediction of what we called geographic levels (administrative jurisdiction of places).

[1]  David Yarowsky,et al.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence , 1999, EMNLP.

[2]  Paolo Rosso,et al.  A conceptual density‐based approach for the disambiguation of toponyms , 2008, Int. J. Geogr. Inf. Sci..

[3]  Paolo Nesi,et al.  Ge(o)Lo(cator): Geographic Information Extraction from Unstructured Text Data and Web Documents , 2014, 2014 9th International Workshop on Semantic and Social Media Adaptation and Personalization.

[4]  Jiwei Li,et al.  A Unified MRC Framework for Named Entity Recognition , 2019, ACL.

[5]  Mário J. Silva,et al.  A graph-ranking algorithm for geo-referencing documents , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[6]  Caroline Sporleder,et al.  Toponym disambiguation in historical documents using semantic and geographic features , 2017, DATeCH.

[7]  Allison Woodruff,et al.  GIPSY: Automated Geographic Indexing of Text Documents , 1994, J. Am. Soc. Inf. Sci..

[8]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[9]  Inderjeet Mani,et al.  SpatialML: Annotation Scheme, Corpora, and Tools , 2008, LREC.

[10]  Javier Nogueras-Iso,et al.  Geocoding for texts with fine-grain toponyms: an experiment on a geoparsed hiking descriptions corpus , 2014, SIGSPATIAL/GIS.

[11]  Nitin Gautam,et al.  Geotagging Text Data on the Web—A Geometrical Approach , 2018, IEEE Access.

[12]  Eneko Agirre,et al.  REPORT ON THE STATE OF THE ART OF NAMED ENTITY AND WORD SENSE DISAMBIGUATION , 2015 .

[13]  Yiannis Kompatsiaris,et al.  Location Extraction from Social Media , 2018, ACM Trans. Inf. Syst..

[14]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[15]  Mário J. Silva,et al.  Adding geographic scopes to web resources , 2006, Comput. Environ. Urban Syst..

[16]  Judith Gelernter,et al.  Cross-lingual geo-parsing for non-structured data , 2013, GIR '13.

[17]  Oscar Sánchez Siordia,et al.  Extracción automática de referencias geoespaciales en discurso libre usando técnicas de procesamiento de lenguaje natural y teoría de la accesibilidad , 2019, Proces. del Leng. Natural.

[18]  Ian N. Gregory,et al.  Customising geoparsing and georeferencing for historical texts , 2013, 2013 IEEE International Conference on Big Data.

[19]  Gosse Bouma,et al.  Every document has a geographical scope , 2012, Data Knowl. Eng..

[20]  John Thickstun,et al.  CONDITIONAL RANDOM FIELDS , 2016 .

[21]  Alan M. MacEachren,et al.  GeoTxt: A scalable geoparsing system for unstructured text geolocation , 2019, Trans. GIS.

[22]  Wei-Yun Ma,et al.  Why Attention? Analyze BiLSTM Deficiency and Its Remedies in the Case of NER , 2020, AAAI.

[23]  António Couto,et al.  An integrated approach for strategic and tactical decisions for the emergency medical service: Exploring optimization and metamodel-based simulation for vehicle location , 2019, Comput. Ind. Eng..

[24]  Eneko Agirre,et al.  Word Sense Disambiguation using Conceptual Density , 1996, COLING.

[25]  Mu-Chen Chen,et al.  Logistics service design for cross-border E-commerce using Kansei engineering with text-mining-based online content analysis , 2017, Telematics Informatics.

[26]  Nigel Collier,et al.  A pragmatic guide to geoparsing evaluation , 2018, Language Resources and Evaluation.

[27]  Claire Grover,et al.  Evaluation of georeferencing , 2010, GIR.

[28]  Franziska Horn Context encoders as a simple but powerful extension of word2vec , 2017, Rep4NLP@ACL.

[29]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[30]  Diana Inkpen,et al.  Location detection and disambiguation from twitter messages , 2017, Journal of Intelligent Information Systems.

[31]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[32]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[33]  George A. Miller WordNet: A Lexical Database for English , 1992, HLT.

[34]  Hanan Samet,et al.  Geotagging: using proximity, sibling, and prominence clues to understand comma groups , 2010, GIR.

[35]  Hai Zhao,et al.  Hierarchical Contextualized Representation for Named Entity Recognition , 2019, AAAI.