Toward Geographic Information Harvesting: Extraction of Spatial Relational Facts from Web Documents

This paper faces the problem of harvesting geographic information from Web documents, specifically, extracting facts on spatial relations among geographic places. The motivation is twofold. First, researchers on Spatial Data Mining often assume that spatial data are already available, thanks to current GIS and positioning technologies. Nevertheless, this is not applicable to the case of spatial information embedded in data without an explicit spatial modeling, such as documents. Second, despite the huge amount of Web documents conveying useful geographic information, there is not much work on how to harvest spatial data from these documents. The problem is particularly challenging because of the lack of annotated documents, which prevents the application of supervised learning techniques. In this paper, we propose to harvest facts on geographic places through an unsupervised approach which recognizes spatial relations among geographic places without supposing the availability of annotated documents. The proposed approach is based on the combined use of a spatial ontology and a prototype-based classifier. A case study on topological and directional relations is reported and commented.

[1]  James C. Bezdek,et al.  Nearest prototype classifier designs: An experimental study , 2001, Int. J. Intell. Syst..

[2]  Frederick Reiss,et al.  SystemT: a system for declarative information extraction , 2009, SGMD.

[3]  Michael F. Worboys,et al.  A generic model for planar geographical objects , 1992, Int. J. Geogr. Inf. Sci..

[4]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[5]  Gerhard Weikum,et al.  From information to knowledge: harvesting entities and relationships from web sources , 2010, PODS '10.

[6]  Gerhard Weikum,et al.  SOFIE: a self-organizing framework for information extraction , 2009, WWW '09.

[7]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[8]  Gerhard Weikum,et al.  YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[9]  Luis Gravano,et al.  Exploiting Geographical Location Information of Web Pages , 1999, WebDB.

[10]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[11]  Marie-Francine Moens,et al.  Spatial role labeling: Towards extraction of spatial relations from natural language , 2011, TSLP.

[12]  Stephen K. Reed,et al.  Pattern recognition and categorization , 1972 .

[13]  Amit P. Sheth,et al.  Semantic (Web) Technology In Action: Ontology Driven Information Systems for Search, Integration and Analysis , 2003, IEEE Data Eng. Bull..

[14]  MAX J. EGENHOFER,et al.  Point Set Topological Relations , 1991, Int. J. Geogr. Inf. Sci..

[15]  Hinrich Schütze,et al.  Fine-Grained Geographical Relation Extraction from Wikipedia , 2010, LREC.

[16]  Gerhard Weikum,et al.  Combining linguistic and statistical analysis to extract relations from web documents , 2006, KDD '06.

[17]  Thora Tenbrink,et al.  A linguistic ontology of space for natural language processing , 2010, Artif. Intell..

[18]  Hector Garcia-Molina,et al.  Extracting structured data from Web pages , 2003, SIGMOD '03.

[19]  Gerhard Weikum,et al.  Scalable knowledge harvesting with high precision and high recall , 2011, WSDM '11.

[20]  Adam Pease,et al.  Towards a standard upper ontology , 2001, FOIS.

[21]  Maguelonne Teisseire,et al.  An Unsupervised Framework for Topological Relations Extraction from Geographic Documents , 2012, DEXA.