Bootstrapping Multilingual Geographical Gazetteers from Corpora

In this paper an approach to automatically generating multilingual geographical name gazetteers via two bootstrapping loops on different corpora is presented. First, a small seed-list of geographical names is matched to an unannotated dataset in one language, and training data for a memory-based classifier is generated. Memory-based learning is applied to extend the gazetteer. Then a cross-over to a different language is made by matching this extended gazetteer to a corpus in a different language. Again, training data for a classifier is generated and the bootstrapping process is repeated in order to extend the gazetteer further. This process is quite similar to co-training, in which information from other sources is introduced to enhance classification. To estimate the difference between the initial seed-list and the final gazetteer and thereby to evaluate the performance of the algorithm, they were matched to three datasets with manually annotated geographical entities.

[1]  Thorsten Brants,et al.  A Context Pattern Induction Method for Named Entity Extraction , 2006, CoNLL.

[2]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide , 1998 .

[3]  Chris Callison-Burch,et al.  Co-training for Statistical Machine Translation , 2002 .

[4]  David Yarowsky,et al.  Inducing Information Extraction Systems for New Languages via Cross-language Projection , 2002, COLING.

[5]  Suzanne Stevenson,et al.  A Multilingual Paradigm for Automatic Verb Classification , 2002, ACL.

[6]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[7]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[8]  Olga Uryupina Semi-supervised learning of geographical gazetteers from the internet , 2003, HLT-NAACL 2003.

[9]  Philip Resnik,et al.  An Unsupervised Method for Word Sense Tagging using Parallel Corpora , 2002, ACL.

[10]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[11]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[12]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[13]  Cheng Niu,et al.  A Bootstrapping Approach to Named Entity Classification Using Successive Learners , 2003, ACL.

[14]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[15]  Ellen Riloff,et al.  Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping , 1999, AAAI/IAAI.

[16]  Ellen Riloff Bootstrapping for text learning tasks , 1999 .