A metadata geoparsing system for place name recognition and resolution in metadata records

This paper describes an approach for performing recognition and resolution of place names mentioned over the descriptive metadata records of typical digital libraries. Our approach exploits evidence provided by the existing structured attributes within the metadata records to support the place name recognition and resolution, in order to achieve better results than by just using lexical evidence from the textual values of these attributes. In metadata records, lexical evidence is very often insufficient for this task, since short sentences and simple expressions are predominant. Our implementation uses a dictionary based technique for recognition of place names (with names provided by Geonames), and machine learning for reasoning on the evidences and choosing a possible resolution candidate. The evaluation of our approach was performed in data sets with a metadata schema rich in Dublin Core elements. Two evaluation methods were used. First, we used cross-validation, which showed that our solution is able to achieve a very high precision of 0,99 at 0,55 recall, or a recall of 0,79 at 0,86 precision. Second, we used a comparative evaluation with an existing commercial service, where our solution performed better on any confidence level (p<0,001).

[1]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[2]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[3]  Nina Wacholder,et al.  Extracting Names from Natural-Language Text , 2000 .

[4]  Enrico Motta,et al.  ESpotter: Adaptive Named Entity Recognition for Web Browsing , 2005, Wissensmanagement.

[5]  Andrei Mikheev A Knowledge-free Method for Capitalized Word Disambiguation , 1999, ACL.

[6]  Sunita Sarawagi,et al.  Information Extraction , 2008 .

[7]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[8]  Yasusi Kanada A method of geographical name extraction from Japanese text for thematic geographical search , 1999, CIKM '99.

[9]  Torsten Becker,et al.  Enhancing RSS Feeds with Extracted Geospatial Information for Further Processing and Visualization , 2007, The Geospatial Web.

[10]  Ron Sivan,et al.  Web-a-where: geotagging web content , 2004, SIGIR '04.

[11]  Diogo Reis,et al.  DIGMAP - Discovering Our Past World with Digitised Maps , 2007, ECDL.

[12]  José Luis Borbinha,et al.  Geographically-aware information retrieval for collections of digitized historical maps , 2007, GIR '07.

[13]  John A. Kunze,et al.  Dublin Core Metadata for Resource Discovery , 1998, RFC.

[14]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[15]  Jochen L. Leidner Toponym resolution in text , 2007 .

[16]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  Houari Maaraj Houari Maaraj,et al.  ENTERPRISE INFORMATION PORTALS VS. ENTERPRISE KNOWLEDGE PORTALS , 2010, Dirassat Journal Economic Issue.

[19]  Sunita Sarawagi,et al.  Efficient Batch Top-k Search for Dictionary-based Entity Recognition , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[20]  Zornitsa Kozareva,et al.  Cluster Analysis and Classification of Named Entities , 2004, LREC.

[21]  Sam Coates-Stephens,et al.  The Analysis and Acquisition of Proper Names for the Understanding of Free Text , 1992, Comput. Humanit..

[22]  Erik Rauch,et al.  A confidence-based framework for disambiguating geographic terms , 2003, HLT-NAACL 2003.

[23]  R. Bayer,et al.  Organization and maintenance of large ordered indices , 1970, SIGFIDET '70.

[24]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.