Extracting Geospatial Entities from Wikipedia

This paper addresses the challenge of extracting geospatial data from the article text of the English Wikipedia. In the first phase of our work, we create a training corpus and select a set of word-based features to train a Support Vector Machine (SVM) for the task of geospatial named entity recognition. We target for testing a corpus of Wikipedia articles about battles and wars, as these have a high incidence of geospatial content. The SVM recognizes place names in the corpus with a very high recall, close to 100\%, with an acceptable precision. The set of geospatial NEs is then fed into a geocoding and resolution process, whose goal is to determine the correct coordinates for each place name. As many place names are ambiguous, and do not immediately geocode to a single location, we present a data structure and algorithm to resolve ambiguity based on sentence and article context, so the correct coordinates can be selected. We achieve an f-measure of 82\%, and create a set of geospatial entities for each article, combining the place names, spatial locations, and an assumed point geometry. These entities can enable geospatial search on and geovisualization of Wikipedia.

[1]  T D'Roza,et al.  An Overview of Location-Based Services , 2003 .

[2]  Ali Mamat,et al.  Named Entity Recognition Using a New Fuzzy Support Vector Machine , 2008 .

[3]  Jeremy Witmer,et al.  Mining Wikipedia Article Clusters for Geospatial Entities and Relationships , 2009, AAAI Spring Symposium: Social Semantic Web: Where Web 2.0 Meets Web 3.0.

[4]  Xing Xie,et al.  Detecting geographic locations from web resources , 2005, GIR '05.

[5]  Øyvind Vestavik Geographic Information Retrieval : An Overview , 2004 .

[6]  Dunja Mladenic,et al.  Extracting Named Entities and Relating Them over Time Based on Wikipedia , 2007, Informatica.

[7]  Hae-Chang Rim,et al.  Two-Phase Biomedical NE Recognition based on SVMs , 2003, BioNLP@ACL.

[8]  Dan Wu,et al.  On assigning place names to geography related web pages , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[9]  Lise Getoor,et al.  Entity resolution in geospatial data integration , 2006, GIS '06.

[10]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[11]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[12]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[13]  Wisam Dakka,et al.  Augmenting Wikipedia with Named Entity Tags , 2008, IJCNLP.

[14]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[15]  José Luis Borbinha,et al.  Extracting and Exploring the Geo-Temporal Semantics of Textual Resources , 2008, 2008 IEEE International Conference on Semantic Computing.

[16]  Gregory R. Crane,et al.  Disambiguating Geographic Names in a Historical Digital Library , 2001, ECDL.

[17]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[18]  Jugal Kalita,et al.  Improving scalability of support vector machines for biomedical named entity recognition , 2008 .

[19]  Bernhard Seeger,et al.  Geographic Information Retrieval , 2004, WebDyn@WWW.