Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes. We first classify each Wikipedia article into named entity (ne) types, training and evaluating on 7200 manually-labelled Wikipedia articles across nine languages. Our cross-lingual approach achieves up to 95% accuracy. We transform the links between articles into ne annotations by projecting the target [email protected]?s classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards. We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against conll shared task data and other gold-standard corpora. Our approach outperforms other approaches to automatic ne annotation (Richman and Schone, 2008 [61], Mika et al., 2008 [46]) competes with gold-standard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text.

[1]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[2]  Robert E. Frederking,et al.  SYNERGY: A Named Entity Recognition System for Resource-scarce Languages such as Swahili using Online Machine Translation , 2010 .

[3]  Ee-Peng Lim,et al.  Measuring article quality in wikipedia: models and evaluation , 2007, CIKM '07.

[4]  Nigel Collier,et al.  Comparison between Tagged Corpora for the Named Entity Task , 2000, ACL 2000.

[5]  Ting Liu,et al.  Generating Chinese Named Entity Data from a Parallel Corpus , 2011, IJCNLP.

[6]  Alexander Ulanov,et al.  Classifying Wikipedia entities into fine-grained classes , 2011, 2011 IEEE 27th International Conference on Data Engineering Workshops.

[7]  Marc Moens,et al.  Named Entity Recognition without Gazetteers , 1999, EACL.

[8]  Fabian M. Suchanek,et al.  Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[9]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[10]  Jeremy H. Clear,et al.  The British national corpus , 1993 .

[11]  Alexander Schill,et al.  Training a Named Entity Recognizer on the Web , 2011, WISE.

[12]  Joel Nothman,et al.  Analysing Wikipedia and Gold-Standard Corpora for NER Training , 2009, EACL.

[13]  Antonio Toral,et al.  A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia , 2006, Workshop On New Text Wikis And Blogs And Other Dynamic Text Sources.

[14]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[15]  Joel Nothman,et al.  Named Entity Recognition in Wikipedia , 2009, PWNLP@IJCNLP.

[16]  Eraldo Rezende Fernandes,et al.  Learning from Partially Annotated Sequences , 2011, ECML/PKDD.

[17]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[18]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[19]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[20]  Carina Silberer,et al.  Building a Multilingual Lexical Resource for Named Entity Disambiguation, Translation and Transliteration , 2008, LREC.

[21]  Satoshi Sekine,et al.  Extended Named Entity Hierarchy , 2002, LREC.

[22]  Ralf Steinberger,et al.  Building a Multilingual Named Entity-Annotated Corpus Using Annotation Projection , 2011, RANLP.

[23]  Antonio Toral,et al.  Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon , 2011, Language Resources and Evaluation.

[24]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[25]  Joel Nothman,et al.  Learning Named Entity Recognition from Wikipedia , 2008 .

[26]  Olena Medelyan,et al.  Integrating Cyc and Wikipedia: Folksonomy meets rigorously defined common-sense , 2008, AAAI 2008.

[27]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[28]  Wisam Dakka,et al.  Augmenting Wikipedia with Named Entity Tags , 2008, IJCNLP.

[29]  Nancy Chinchor,et al.  The Multilingual Entity Task (MET) Overview , 1996, TIPSTER.

[30]  Nan Ye,et al.  Domain adaptive bootstrapping for named entity recognition , 2009, EMNLP.

[31]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[32]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[33]  Gary Geunbae Lee,et al.  Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web , 2003, ACL.

[34]  Joel Nothman,et al.  Classifying articles in English and German Wikipedia , 2009, ALTA.

[35]  Nancy A. Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[36]  Meliha Yetisgen-Yildiz,et al.  Annotating Large Email Datasets for Named Entity Recognition with Mechanical Turk , 2010, Mturk@HLT-NAACL.

[37]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[38]  Peter Mika,et al.  Learning to Tag and Tagging to Learn: A Case Study on Wikipedia , 2008, IEEE Intelligent Systems.

[39]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[40]  Y. Altun,et al.  Named-Entity Recognition in Novel Domains with External Lexical Knowledge , 2005 .

[41]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[42]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[43]  Julia Hirschberg,et al.  An Unsupervised Approach to Biography Production Using Wikipedia , 2008, ACL.

[44]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[45]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[46]  Hwee Tou Ng,et al.  One Class per Named Entity: Exploiting Unlabeled Text for Named Entity Recognition , 2007, IJCAI.

[47]  Simone Paolo Ponzetto,et al.  BabelNet: Building a Very Large Multilingual Semantic Network , 2010, ACL.

[48]  Elena Filatova Multilingual Wikipedia , Summarization , and Information Trustworthiness , 2009 .

[49]  Antonio Toral,et al.  Applying Wikipedia's Multilingual Knowledge to Cross-Lingual Question Answering , 2007, NLDB.

[50]  James R. Curran,et al.  Improved Text Categorisation for Wikipedia Named Entities , 2009, ALTA.

[51]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[52]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[53]  Gosse Bouma,et al.  Cross-lingual Alignment and Completion of Wikipedia Templates , 2009 .

[54]  Will Fitzgerald,et al.  A Hybrid Model for Annotating Named Entity Training Corpora , 2010, Linguistic Annotation Workshop.

[55]  Michael Strube,et al.  WikiNet: A Very Large Scale Multi-Lingual Concept Network , 2010, LREC.

[56]  Michael Skinner,et al.  Information arbitrage across multi-lingual Wikipedia , 2009, WSDM '09.

[57]  Kentaro Torisawa,et al.  Inducing Gazetteers for Named Entity Recognition by Large-Scale Clustering of Dependency Relations , 2008, ACL.

[58]  Xiaoyi Ma Toward a Name Entity Aligned Bilingual Corpus , 2010 .

[59]  Sriharsha Veeramachaneni,et al.  A Simple Semi-supervised Algorithm For Named Entity Recognition , 2009, HLT-NAACL 2009.

[60]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[61]  Maria Ruiz-Casado,et al.  Automatic Assignment of Wikipedia Encyclopedic Entries to WordNet Synsets , 2005, AWIC.

[62]  Jun Suzuki,et al.  Semi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data , 2008, ACL.

[63]  Yuji Matsumoto,et al.  A Graph-Based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields , 2007, EMNLP.

[64]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[65]  Joel Nothman,et al.  Transforming Wikipedia into Named Entity Training Data , 2008, ALTA.

[66]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[67]  Ting Liu,et al.  Generating Chinese named entity data from parallel corpora , 2014, Frontiers of Computer Science.

[68]  Evgeniy Gabrilovich,et al.  Overcoming the Brittleness Bottleneck using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge , 2006, AAAI.

[69]  Gerhard Weikum,et al.  MENTA: inducing multilingual taxonomies from wikipedia , 2010, CIKM '10.

[70]  Simone Paolo Ponzetto,et al.  WikiRelate! Computing Semantic Relatedness Using Wikipedia , 2006, AAAI.

[71]  Philipp Cimiano,et al.  Enriching the crosslingual link structure of Wikipedia - A classification-based approach , 2008, AAAI 2008.

[72]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[73]  Hinrich Schütze,et al.  Active Learning with Amazon Mechanical Turk , 2011, EMNLP.

[74]  Stan Matwin,et al.  Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian AI.

[75]  András A. Benczúr,et al.  Cross-Language Retrieval with Wikipedia , 2008, CLEF.

[76]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[77]  Walt Detmar Meurers,et al.  Detecting Errors in Part-of-Speech Annotation , 2003, EACL.

[78]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[79]  Manaal Faruqui,et al.  Training and Evaluating a German Named Entity Recognizer with Semantic Generalization , 2010, KONVENS.

[80]  Simone Paolo Ponzetto,et al.  Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia , 2009, IJCAI.

[81]  Péter Schönhofen,et al.  Identifying Document Topics Using the Wikipedia Category Network , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[82]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[83]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.