Automatically building large-scale named entity recognition corpora from Chinese Wikipedia

Named entity recognition (NER) is a core component in many natural language processing applications. Most NER systems rely on supervised machine learning methods, which depend on time-consuming and expensive annotations in different languages and domains. This paper presents a method for automatically building silver-standard NER corpora from Chinese Wikipedia. We refine novel and language-dependent features by exploiting the text and structure of Chinese Wikipedia. To reduce tagging errors caused by entity classification, we design four types of heuristic rules based on the characteristics of Chinese Wikipedia and train a supervised NE classifier, and a combined method is used to improve the precision and coverage. Then, we realize type identification of implicit mention by using boundary information of outgoing links. By selecting the sentences related with the domains of test data, we can train better NER models. In the experiments, large-scale NER corpora containing 2.3 million sentences are built from Chinese Wikipedia. The results show the effectiveness of automatically annotated corpora, and the trained NER models achieve the best performance when combining our silver-standard corpora with gold-standard corpora.

[1]  Хироси Канеда,et al.  automatic recognition system , 2004 .

[2]  Heng Ji,et al.  Overview of the TAC 2010 Knowledge Base Population Track , 2010 .

[3]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[4]  Antonio Toral,et al.  Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon , 2011, Language Resources and Evaluation.

[5]  Xian Wu,et al.  Domain Adaptation with Latent Semantic Association for Named Entity Recognition , 2009, NAACL.

[6]  Matt Post,et al.  Domain Adaptation , 2017, Encyclopedia of Machine Learning and Data Mining.

[7]  Marco Turchi,et al.  Building Multilingual Named Entity Annotated Corpora Exploiting Parallel Corpora , 2010 .

[8]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[9]  Sriharsha Veeramachaneni,et al.  A Simple Semi-supervised Algorithm For Named Entity Recognition , 2009, HLT-NAACL 2009.

[10]  ChengXiang Zhai,et al.  Exploiting Domain Structure for Named Entity Recognition , 2006, NAACL.

[11]  Joel Nothman,et al.  Named Entity Recognition in Wikipedia , 2009, PWNLP@IJCNLP.

[12]  Yuquan Chen,et al.  Computing Semantic Relatedness between Named Entities Using Wikipedia , 2010, 2010 International Conference on Artificial Intelligence and Computational Intelligence.

[13]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[14]  Peter Mika,et al.  Learning to Tag and Tagging to Learn: A Case Study on Wikipedia , 2008, IEEE Intelligent Systems.

[15]  Mark G. Lee,et al.  Mapping Arabic Wikipedia into the Named Entities Taxonomy , 2012, COLING.

[16]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[17]  Y. Altun,et al.  Named-Entity Recognition in Novel Domains with External Lexical Knowledge , 2005 .

[18]  Iryna Gurevych,et al.  Computing Semantic Relatedness of GermaNet Concepts , 2005 .

[19]  Joel Nothman,et al.  Learning multilingual named entity recognition from Wikipedia , 2013, Artif. Intell..

[20]  Ming Zhou,et al.  Recognizing Named Entities in Tweets , 2011, ACL.

[21]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[22]  J. Ross Stenstrom,et al.  for Automatic Recognition , 2017 .

[23]  Dávid Márk Nemeskey,et al.  Automatically generated NE tagged corpora for English and Hungarian , 2012, NEWS@ACL.

[24]  Ian H. Witten,et al.  Mining Meaning from Wikipedia , 2008, Int. J. Hum. Comput. Stud..

[25]  Kareem Darwish,et al.  Named Entity Recognition using Cross-lingual Resources: Arabic as an Example , 2013, ACL.

[26]  Joel Nothman,et al.  Transforming Wikipedia into Named Entity Training Data , 2008, ALTA.

[27]  Michael Strube,et al.  Transforming Wikipedia into a large scale multilingual concept network , 2013, Artif. Intell..

[28]  Stan Matwin,et al.  Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian AI.

[29]  Gary Geunbae Lee,et al.  Automatic Acquisition of Named Entity Tagged Corpus from World Wide Web , 2003, ACL.

[30]  Chen Wang,et al.  Enhancing the Open-Domain Classification of Named Entity Using Linked Open Data , 2010, SEMWEB.

[31]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[32]  Chen Jia-jun,et al.  Automatic Recognition of Chinese Organization Name Based on Cascaded Conditional Random Fields , 2006 .

[33]  Wisam Dakka,et al.  Augmenting Wikipedia with Named Entity Tags , 2008, IJCNLP.

[34]  Ting Liu,et al.  Generating Chinese Named Entity Data from a Parallel Corpus , 2011, IJCNLP.

[35]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[36]  ChengXiang Zhai,et al.  A two-stage approach to domain adaptation for statistical classifiers , 2007, CIKM '07.

[37]  Kugatsu Sadamitsu,et al.  Creating an Extended Named Entity Dictionary from Wikipedia , 2012, COLING.

[38]  Iryna Gurevych,et al.  Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[39]  Han Xianpei,et al.  A Entity Relation Extraction Method Based on Wikipedia and Pattern Clustering , 2012 .