Adaptive context features for toponym resolution in streaming news

News sources around the world generate constant streams of information, but effective streaming news retrieval requires an intimate understanding of the geographic content of news. This process of understanding, known as geotagging, consists of first finding words in article text that correspond to location names (toponyms), and second, assigning each toponym its correct lat/long values. The latter step, called toponym resolution, can also be considered a classification problem, where each of the possible interpretations for each toponym is classified as correct or incorrect. Hence, techniques from supervised machine learning can be applied to improve accuracy. New classification features to improve toponym resolution, termed adaptive context features, are introduced that consider a window of context around each toponym, and use geographic attributes of toponyms in the window to aid in their correct resolution. Adaptive parameters controlling the window's breadth and depth afford flexibility in managing a tradeoff between feature computation speed and resolution accuracy, allowing the features to potentially apply to a variety of textual domains. Extensive experiments with three large datasets of streaming news demonstrate the new features' effectiveness over two widely-used competing methods.

[1]  Hanan Samet,et al.  Geotagging: using proximity, sibling, and prominence clues to understand comma groups , 2010, GIR.

[2]  Hanan Samet,et al.  Multifaceted toponym recognition for streaming news , 2011, SIGIR.

[3]  Jason Baldridge,et al.  Simple supervised document geolocation with geodesic grids , 2011, ACL.

[4]  Michael Gertz,et al.  Extraction and exploration of spatio-temporal information in documents , 2010, GIR.

[5]  Hanan Samet,et al.  NewsStand: a new view on news , 2008, GIS '08.

[6]  Jimmy J. Lin,et al.  You Are Where You Edit: Locating Wikipedia Contributors through Edit Histories , 2009, ICWSM.

[7]  Hanan Samet,et al.  Geotagging with local lexicons to build indexes for textually-specified spatial data , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[8]  Hanan Samet,et al.  Adapting a map query interface for a gesturing touch screen interface , 2011, WWW.

[9]  Albert Weichselbraun A Utility Centered Approach for Evaluating and Optimizing Geo-tagging , 2009, KDIR.

[10]  Bruno Martins,et al.  A Machine Learning Approach for Resolving Place References in Text , 2010, AGILE Conf..

[11]  Ron Sivan,et al.  Web-a-where: geotagging web content , 2004, SIGIR '04.

[12]  Davide Buscaldi,et al.  Grounding toponyms in an Italian local news corpus , 2010, GIR.

[13]  Mathias Lux,et al.  Geospatial Anchoring of Encyclopedia Articles , 2006, Tenth International Conference on Information Visualisation (IV'06).

[14]  Filip Radlinski,et al.  Inferring and using location metadata to personalize web search , 2011, SIGIR.

[15]  Hanan Samet,et al.  Determining the spatial reader scopes of news sources using local lexicons , 2010, GIS '10.

[16]  Reiner Kraft,et al.  A scalable machine-learning approach for semi-structured named entity recognition , 2010, WWW '10.

[17]  W. Tobler A Computer Movie Simulating Urban Growth in the Detroit Region , 1970 .

[18]  Jochen L. Leidner Toponym resolution in text: annotation, evaluation and applications of spatial grounding , 2007, SIGF.

[19]  Rocío Abascal-Mena,et al.  Geo information extraction and processing from travel narratives , 2010, ELPUB.

[20]  Avi Arampatzis,et al.  The design and implementation of SPIRIT: a spatially aware search engine for information retrieval on the Internet , 2007, Int. J. Geogr. Inf. Sci..

[21]  Walid G. Aref,et al.  Efficient processing of window queries in the pyramid data structure , 1990, PODS '90.

[22]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[23]  Allison Woodruff,et al.  GIPSY: automated geographic indexing of text documents , 1994 .

[24]  Marcos André Gonçalves,et al.  Geographical classification of documents using evidence from Wikipedia , 2010, GIR.

[25]  Hanan Samet,et al.  Use of the SAND spatial browser for digital government applications , 2003, CACM.

[26]  Clifford A. Shaffer,et al.  QUILT: a geographic information system based on quadtrees , 1990, Int. J. Geogr. Inf. Sci..

[27]  Mark Sanderson,et al.  Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2012, SIGIR 2012.

[28]  Claire Grover,et al.  Evaluation of georeferencing , 2010, GIR.

[29]  Erik Rauch,et al.  A confidence-based framework for disambiguating geographic terms , 2003, HLT-NAACL 2003.

[30]  Hanan Samet,et al.  TwitterStand: news in tweets , 2009, GIS.

[31]  Hanan Samet,et al.  Ontuition: intuitive data exploration via ontology navigation , 2010, GIS '10.

[32]  Inderjeet Mani,et al.  Disambiguating Toponyms in News , 2005, HLT/EMNLP.

[33]  Xing Xie,et al.  An efficient location extraction algorithm by leveraging web contextual information , 2010, GIS '10.

[34]  Linlin Ge,et al.  A Supervised Machine Learning Approach to Toponym Disambiguation , 2007, The Geospatial Web.

[35]  Pável Calado,et al.  Classifying Documents According to Locational Relevance , 2009, EPIA.

[36]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[37]  Michael D. Lieberman You Are Where You Edit : Locating Wikipedia Users Through Edit Histories ∗ , 2009 .

[38]  Cheng Niu,et al.  InfoXtract location normalization: a hybrid approach to geographic references in information extraction , 2003, HLT-NAACL 2003.

[39]  José Luis Borbinha,et al.  Extracting and Exploring the Geo-Temporal Semantics of Textual Resources , 2008, 2008 IEEE International Conference on Semantic Computing.

[40]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[41]  David G. Stork,et al.  Pattern Classification , 1973 .

[42]  Mark Sanderson,et al.  Geo-tagging for imprecise regions of different sizes , 2007, GIR '07.

[43]  Hanan Samet,et al.  STEWARD: architecture of a spatio-textual search engine , 2007, GIS.