Cross-lingual geo-parsing for non-structured data

A geo-parser automatically identifies location words in a text. We have generated a geo-parser specifically to find locations in unstructured Spanish text. Our novel geo-parser architecture combines the results of four parsers: a lexico-semantic Named Location Parser, a rules-based building parser, a rules-based street parser, and a trained Named Entity Parser. Each parser has different strengths: the Named Location Parser is strong in recall, and the Named Entity Parser is strong in precision, and building and street parser finds buildings and streets that the others are not designed to do. To test our Spanish geo-parser performance, we compared the output of Spanish text through our Spanish geo-parser, with that same Spanish text translated into English and run through our English geo-parser. The results were that the Spanish geo-parser identified toponyms with an F1 of .796, and the English geo-parser identified toponyms with an F1 of .861 (and this is despite errors introduced by translation from Spanish to English), compared to an F1 of .114 from a commercial off-the-shelf Spanish geo-parser. Results suggest (1) geo-parsers should be built specifically for unstructured text, as have our Spanish and English geo-parsers, and (2) location entities in Spanish that have been machine translated to English are robust to geo-parsing in English.

[1]  Judith Gelernter,et al.  An algorithm for local geoparsing of microtext , 2013, GeoInformatica.

[2]  Robert E. Frederking,et al.  SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation , 2010 .

[3]  Mirella Lapata,et al.  Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, 6-7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the ACL , 2009, EMNLP.

[4]  Huiji Gao,et al.  Harnessing the Crowdsourcing Power of Social Media for Disaster Relief , 2011, IEEE Intelligent Systems.

[5]  Michael Gamon,et al.  Practical Experience with Grammar Sharing in Multilingual NLP , 1997 .

[6]  Mário J. Silva,et al.  Geographic signatures for semantic retrieval , 2010, GIR.

[7]  Bruno Pouliquen,et al.  Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili , 2011, Lang. Resour. Evaluation.

[8]  Michael Gertz,et al.  An event-centric model for multilingual document similarity , 2011, SIGIR '11.

[9]  K. Gwet Kappa Statistic is not Satisfactory for Assessing the Extent of Agreement Between Raters , 2002 .

[10]  Rocio Guillén GeoParsing Web Queries , 2007, CLEF.

[11]  Ben Hutchinson,et al.  Using the Web for Language Independent Spellchecking and Autocorrection , 2009, EMNLP.

[12]  Ralf Steinberger,et al.  A survey of methods to ease the development of highly multilingual text mining applications , 2011, Language Resources and Evaluation.

[13]  Dieter Pfoser,et al.  Qualitative geocoding of persistent web pages , 2008, GIS '08.

[14]  Leysia Palen,et al.  Microblogging during two natural hazards events: what twitter may contribute to situational awareness , 2010, CHI.

[15]  Judith Gelernter,et al.  Geo‐parsing Messages from Microtext , 2011, Trans. GIS.

[16]  Scott Gaffney,et al.  Learning a Named Entity Tagger from Gazetteers with the Partial Perceptron , 2009, AAAI Spring Symposium: Learning by Reading and Learning to Read.

[17]  Robert Munro,et al.  Subword and Spatiotemporal Models for Identifying Actionable Information in Haitian Kreyol , 2011, CoNLL.

[18]  Yorick Wilks,et al.  How feasible is the reuse of grammars for Named Entity Recognition? , 2002, LREC.