Named Entity Recognition in Wikipedia

Named entity recognition (NER) is used in many domains beyond the newswire text that comprises current gold-standard corpora. Recent work has used Wikipedia's link structure to automatically generate near gold-standard annotations. Until now, these resources have only been evaluated on newswire corpora or themselves. We present the first NER evaluation on a Wikipedia gold standard (WG) corpus. Our analysis of cross-corpus performance on WG shows that Wikipedia text may be a harder NER domain than newswire. We find that an automatic annotation of Wikipedia has high agreement with WG and, when used as training data, outperforms newswire models by up to 7.7%.

[1]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[2]  Tibor Kiss,et al.  Unsupervised Multilingual Sentence Boundary Detection , 2006, CL.

[3]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[4]  Daniel Gildea,et al.  Corpus Variation and Parser Performance , 2001, EMNLP.

[5]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[6]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[7]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[8]  Y. Altun,et al.  Named-Entity Recognition in Novel Domains with External Lexical Knowledge , 2005 .

[9]  Daniel S. Weld,et al.  Information extraction from Wikipedia: moving down the long tail , 2008, KDD.

[10]  Nigel Collier,et al.  Comparison between Tagged Corpora for the Named Entity Task , 2000, ACL 2000.

[11]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[12]  James R. Curran,et al.  Investigating GIS and Smoothing for Maximum Entropy Taggers , 2003, EACL.

[13]  Joel Nothman,et al.  Analysing Wikipedia and Gold-Standard Corpora for NER Training , 2009, EACL.

[14]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[15]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[16]  Jacob Cohen,et al.  The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability , 1973 .

[17]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[18]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[19]  Peter Mika,et al.  Learning to Tag and Tagging to Learn: A Case Study on Wikipedia , 2008, IEEE Intelligent Systems.

[20]  Joel Nothman,et al.  Transforming Wikipedia into Named Entity Training Data , 2008, ALTA.

[21]  Nancy A. Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[22]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[23]  Joel Nothman,et al.  Learning Named Entity Recognition from Wikipedia , 2008 .