Toponym resolution in text: annotation, evaluation and applications of spatial grounding

In Information Extraction (IE), processing of named entities in text has traditionally been seen as a two-step process comprising a flat text span recognition sub-task and an atomic classification sub-task; relating the text span to a model of the world has been ignored by evaluations such as DARPA/NIST's MUC or ACE. However, spatial and temporal expressions refer to events in space-time, and the grounding of events is a precondition for accurate reasoning. Thus, automatic grounding can improve many applications such as automatic map drawing (e.g. for choosing a focus) and question answering (e.g., for questions like How far is London from Edinburgh, given a story in which both occur and can be resolved). Whereas temporal grounding has received considerable attention in the recent Past [2, 3], robust spatial grounding has long been neglected. Concentrating on geographic names for populated places, I define the task of automatic Toponym Resolution (TR) as computing the mapping from occurrences of names for places as found in a text to a representation of the extensional semantics of the location referred to (its referent), such as a geographic latitude/longitude footprint. The task of mapping from names to locations is hard due to insufficient and noisy databases, and a large degree of ambiguity: common words need to be distinguished from proper names (geo/non-geo ambiguity), and the mapping between names and locations is ambiguous London can refer to the capital of the UK or to London, Ontario, Canada, or to about forty other Londons on earth). In addition, names of places and the boundaries referred to change over time, and databases are incomplete.

[1]  Y. Tuan,et al.  Space and Place: The Perspective of Experience. , 1978 .

[2]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[3]  Herbert A. Simon,et al.  Why a Diagram is (Sometimes) Worth Ten Thousand Words , 1987, Cogn. Sci..

[4]  Soteria Svorou,et al.  The grammar of space , 1994 .

[5]  Nancy A. Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[6]  Stefan Evert,et al.  The NITE XML Toolkit: Flexible annotation for multimodal language data , 2003, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[7]  Linda L. Hill,et al.  Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints , 2000, ECDL.

[8]  Walter L. Smith Probability and Statistics , 1959, Nature.

[9]  Linda L. Hill Access to geographic concepts in online bibliographic files: effectiveness of current practices and the potential of a graphic interface , 1990 .

[10]  Mark Sanderson,et al.  Word sense disambiguation and information retrieval , 1994, SIGIR '94.

[11]  Adam Kilgarriff,et al.  Framework and Results for English SENSEVAL , 2000, Comput. Humanit..

[12]  Siobhan Chapman Logic and Conversation , 2005 .

[13]  Roy T. Fielding,et al.  Hypertext Transfer Protocol - HTTP/1.1 , 1997, RFC.

[14]  Olga Uryupina,et al.  Semi-supervised learning of geographical gazetteer from the internet , 2003, Workshop On Analysis Of Geographic References.

[15]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[16]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[17]  Gideon S. Mann,et al.  Bootstrapping toponym classifiers , 2003, HLT-NAACL 2003.

[18]  Yoko NISHIMURA,et al.  Google Earth , 2008, Encyclopedia of GIS.

[19]  Anthony G. Cohn,et al.  Qualitative Spatial Representation and Reasoning with the Region Connection Calculus , 1997, GeoInformatica.

[20]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[21]  Rohini K. Srihari,et al.  A Hybrid Approach for Named Entity and Sub-Type Tagging , 2000, ANLP.

[22]  Markus Neteler,et al.  Open Source GIS: A GRASS GIS Approach , 2007 .

[23]  Zhong-ren Peng,et al.  Internet GIS: Distributed Geographic Information Services for the Internet and Wireless Networks , 2003 .

[24]  Patrice Enjalbert,et al.  Geographic reference analysis for geographic document querying , 2003, HLT-NAACL 2003.

[25]  Marilyn Eileen Jessen A semantic study of spatial and temporal expressions in English , 1974 .

[26]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[27]  Ivar Jacobson,et al.  The Unified Software Development Process , 1999 .

[28]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[29]  Nancy Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[30]  Schuyler Erle,et al.  Mapping hacks : tips & tools for electronic cartography , 2005 .

[31]  Alexander G. Hauptmann,et al.  USING LOCATION INFORMATION FROM SPEECH RECOGNITION OF TELEVISION NEWS BROADCASTS , 1999 .

[32]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[33]  J. Kruskal On the shortest spanning subtree of a graph and the traveling salesman problem , 1956 .

[34]  Ellen M. Voorhees,et al.  Overview of TREC 2004 , 2004, TREC.

[35]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[36]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Evaluation , 2000, TREC.

[37]  Howard D. Wactlar,et al.  Complementary video and audio analysis for broadcast news archives , 2000, CACM.

[38]  Erik Rauch,et al.  A confidence-based framework for disambiguating geographic terms , 2003, HLT-NAACL 2003.

[39]  Cheng Niu,et al.  InfoXtract location normalization: a hybrid approach to geographic references in information extraction , 2003, HLT-NAACL 2003.

[40]  Jochen L. Leidner Current Issues in Software Engineering for Natural Language Processing , 2003, HLT-NAACL 2003.

[41]  Jochen L. Leidner Toponym Resolution in Text: “Which Sheffield is it?” , 2004 .

[42]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[43]  James Pustejovsky,et al.  Annotation of Temporal and Event Expressions , 2003, HLT-NAACL.

[44]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[45]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[46]  Naomi Sager,et al.  Natural Language Information Processing: A Computer Grammar of English and Its Applications , 1980 .

[47]  Allison Woodruff,et al.  The Sequoia 2000 Electronic Repository , 1995, Digit. Tech. J..

[48]  Breck Baldwin,et al.  Cross-Document Event Coreference: Annotations, Experiments, and Observations , 1999, COREF@ACL.

[49]  Sharon Oviatt,et al.  Multimodal interactive maps: designing for human performance , 1997 .

[50]  Dan Wu,et al.  On assigning place names to geography related web pages , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[51]  Douglas E. Appelt,et al.  Deductive Question Answering from Multiple Resources , 2004, New Directions in Question Answering.

[52]  Andy Shaw AlertNet Webmap Initiative - New Media Approaches to Mapping Humanitarian Response , 2003 .

[53]  Andrew Tomkins,et al.  How to build a WebFountain: An architecture for very large-scale text analytics , 2004, IBM Syst. J..

[54]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[55]  David Yarowsky,et al.  Desparately Seeking Cebuano , 2003, NAACL.

[56]  Jochen L. Leidner A wireless natural language search engine , 2005, SIGIR '05.

[57]  Anthony G. Cohn,et al.  A Spatial Logic based on Regions and Connection , 1992, KR.

[58]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[59]  Douglas E. Appelt,et al.  FASTUS: A Finite-state Processor for Information Extraction from Real-world Text , 1993, IJCAI.

[60]  Thomas D. Sandry,et al.  Introductory Statistics With R , 2003, Technometrics.

[61]  Malvina Nissim,et al.  Towards a Corpus Annotated for Metonymies: the Case of Location Names , 2002, LREC.

[62]  Stephen Potter,et al.  A Framework for Text Mining Services , 2004 .

[63]  James R. Curran,et al.  Parsing the WSJ Using CCG and Log-Linear Models , 2004, ACL.

[64]  Nancy Ide,et al.  Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art , 1998, Comput. Linguistics.

[65]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[66]  Patrick Oliver,et al.  Representation and Processing of Spatial Expressions , 1998 .

[67]  Marc Moens,et al.  LT TTT - A Flexible Tokenisation Tool , 2000, LREC.

[68]  Jerry R. Hobbs Overview of the TACITUS Project , 1986, HLT.

[69]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .

[70]  Fredric C. Gey,et al.  GeoCLEF: the CLEF 2005 Cross-Language Geographic Information Retrieval Track , 2005, CLEF.

[71]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[72]  Ian Densham,et al.  System Demo: A geo-coding service encompassing a geo-parsing tool and integrated digital gazetteer service , 2003, HLT-NAACL 2003.

[73]  Lynette Hirschman,et al.  Natural language question answering: the view from here , 2001, Natural Language Engineering.

[74]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[75]  A. V. Phillips,et al.  A Question-Answering Routine , 1960 .

[76]  Satoshi Sekine,et al.  Extended Named Entity Hierarchy , 2002, LREC.

[77]  Amanda Spink,et al.  How are we searching the World Wide Web? A comparison of nine search engine transaction logs , 2006, Inf. Process. Manag..

[78]  Mor Naaman,et al.  Assigning textual names to sets of geographic coordinates , 2006, Comput. Environ. Urban Syst..

[79]  Hanan Samet,et al.  The Quadtree and Related Hierarchical Data Structures , 1984, CSUR.

[80]  Jochen L. Leidner Toponym Resolution : A First Large-Scale Comparative Evaluation , 2006 .

[81]  Hans-Ulrich Krieger SDL—A Description Language for Building NLP Systems , 2003, HLT-NAACL 2003.

[82]  E. H. Hutten SEMANTICS , 1953, The British Journal for the Philosophy of Science.

[83]  Bert F. Green,et al.  Baseball: an automatic question-answerer , 1899, IRE-AIEE-ACM '61 (Western).

[84]  Wolfgang Maass,et al.  Spatial Layout Identification and Incremental Descriptions , 1994 .

[85]  Anthony McEnery,et al.  Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech , 2003 .

[86]  R. Polikar,et al.  Dynamically weighted majority voting for incremental learning and comparison of three boosting based approaches , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[87]  James R. Curran,et al.  Investigating GIS and Smoothing for Maximum Entropy Taggers , 2003, EACL.

[88]  Otis Gospodnetic,et al.  Lucene in Action , 2004 .

[89]  Peter Steenkiste,et al.  A Hybrid Location Model with a Computable Location Identifier for Ubiquitous Computing , 2002, UbiComp.

[90]  Eduard H. Hovy,et al.  Fine Grained Classification of Named Entities , 2002, COLING.

[91]  Bruno Pouliquen,et al.  Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation , 2006, LREC.

[92]  Jochen L. Leidner Towards a Reference Corpus for Automatic Toponym Resolution Evaluation , 2004 .

[93]  Axel Küpper Location-based Services: Fundamentals and Operation , 2005 .

[94]  Richard Waldinger,et al.  Pointing to places in a deductive geospatial theory , 2003, HLT-NAACL 2003.

[95]  Allen Kent,et al.  Machine literature searching VIII. Operational criteria for designing information retrieval systems , 1955 .

[96]  Kenneth B. Sall XML family of specifications : a practical guide , 2002 .

[97]  Martha Palmer,et al.  The English all-words task , 2004, SENSEVAL@ACL.

[98]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[99]  R. Sinnott Virtues of the Haversine , 1984 .

[100]  M. Sanderson,et al.  Analyzing geographic queries , 2004 .

[101]  M. Goodchild,et al.  Geographic Information Systems and Science (second edition) , 2001 .

[102]  Joseph O'Rourke,et al.  Computational Geometry in C. , 1995 .

[103]  Allison Woodruff,et al.  GIPSY: Automated Geographic Indexing of Text Documents , 1994, J. Am. Soc. Inf. Sci..

[104]  S. Pollock Measures for the comparison of information retrieval systems , 1968 .

[105]  Bruno Pouliquen,et al.  Geographical information recognition and visualization in texts written in various languages , 2004, SAC '04.

[106]  HAMISH CUNNINGHAM,et al.  Software architecture for language engineering , 2000 .

[107]  Paul Clough,et al.  A proposal for comparative evaluation of automatic annotation for geo-referenced documents , 2005 .

[108]  Ray R. Larson,et al.  Spatial Ranking Methods for Geographic Information Retrieval (GIR) in Digital Libraries , 2004, ECDL.

[109]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[110]  Ronald L. Rivest,et al.  Learning decision lists , 2004, Machine Learning.

[111]  Yannick Versley,et al.  Extracting spatial information : grounding , classifying and linking spatial expressions [ Extended Abstract ] , 2022 .

[112]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[113]  David Yarowsky,et al.  One Sense Per Discourse , 1992, HLT.

[114]  Bonnie L. Webber,et al.  Towards the Use of Automated Reasoning in Discourse Disambiguation , 2001, J. Log. Lang. Inf..

[115]  L. Tiina Sarjakoski,et al.  An Approach to Intelligent Maps: Context Awareness , 2003 .

[116]  B. Webber,et al.  Answer Comparison : Analysis of Relationships between Answers to ‘ Where ’-Questions , 2004 .

[117]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences , 2022, The SAGE Encyclopedia of Research Design.

[118]  Andrea Setzer,et al.  Temporal information in newswire articles : an annotation scheme and corpus study , 2001 .

[119]  Grace Hui Yang,et al.  Structured use of external knowledge for event-based open domain question answering , 2003, SIGIR.

[120]  Ian H. Witten,et al.  How to Build a Digital Library , 2002 .

[121]  Jochen L. Leidner,et al.  Grounding spatial named entities for information extraction and question answering , 2003, HLT-NAACL 2003.

[122]  George Lakoff,et al.  Women, Fire, and Dangerous Things , 1987 .

[123]  Paul D. Clough Extracting metadata for spatially-aware information retrieval on the internet , 2005, GIR '05.

[124]  David Robinson,et al.  The WWW Common Gateway Interface Version 1.1 , 1996 .

[125]  Agnès Voisard,et al.  Spatial databases - with applications to GIS , 2002 .

[126]  Cheng Niu,et al.  Location Normalization for Information Extraction , 2002, COLING.

[127]  Ray R. Larson,et al.  Geographic information retrieval and spatial browsing , 1996 .

[128]  Ezra Black,et al.  An Experiment in Computational Discrimination of English Word Senses , 1988, IBM J. Res. Dev..

[129]  Kevin Humphreys,et al.  New Directions in Question Answering , 2006, Information Retrieval.

[130]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[131]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[132]  Stefan M. Rüger,et al.  Identifying and grounding descriptions of places , 2006, GIR.

[133]  Richard E. Korf,et al.  Depth-First Iterative-Deepening: An Optimal Admissible Tree Search , 1985, Artif. Intell..

[134]  not Cwi,et al.  XHTML™ 1.0 The Extensible HyperText Markup Language , 2002 .

[135]  Douglas E. Appelt,et al.  FASTUS: A System for Extracting Information from Natural-Language Text , 1992 .

[136]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[137]  Inderjeet Mani,et al.  Robust Temporal Processing of News , 2000, ACL.

[138]  R. Prim Shortest connection networks and some generalizations , 1957 .

[139]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[140]  Harith Alani,et al.  Voronoi-based region approximation for geographical information retrieval with gazetteers , 2001, Int. J. Geogr. Inf. Sci..

[141]  Ron Sivan,et al.  Web-a-where: geotagging web content , 2004, SIGIR '04.

[142]  S. Levinson,et al.  LANGUAGE AND SPACE , 1996 .

[143]  Jochen L. Leidner An evaluation dataset for the toponym resolution task , 2006, Comput. Environ. Urban Syst..

[144]  Abdur Chowdhury,et al.  A picture of search , 2006, InfoScale '06.