Government Domain Named Entity Recognition for South African Languages

This paper describes the named entity language resources developed as part of a development project for the South African languages. The development efforts focused on creating protocols and annotated data sets with at least 15,000 annotated named entity tokens for ten of the official South African languages. The description of the protocols and annotated data sets provide an overview of the problems encountered during the annotation of the data sets. Based on these annotated data sets, CRF named entity recognition systems are developed that leverage existing linguistic resources. The newly created named entity recognisers are evaluated, with F-scores of between 0.64 and 0.77, and error analysis is performed to identify possible avenues for improving the quality of the systems.

[1]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[2]  Gordon Derrac Matthew Benoemde–entiteitherkenning vir Afrikaans , 2013 .

[3]  Roald Eiselen,et al.  Developing Text Resources for Ten South African Languages , 2014, LREC.

[4]  Maria T. Pazienza,et al.  Information Extraction , 2002, Lecture Notes in Computer Science.

[5]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[6]  Dirk Snyman,et al.  Comparing support vector machine and multinomial naive Bayes for named entity classification of South African languages , 2014 .

[7]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[8]  Ralf Steinberger,et al.  ONTS: “Optima” News Translation System , 2012, EACL.

[9]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[10]  Martin Johannes Puttkammer Outomatiese Afrikaanse tekseenheididentifisering , 2006 .

[11]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[12]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[13]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[14]  Wendy G. Lehnert,et al.  Information extraction , 1996, CACM.

[15]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[16]  Deborah A. Nichols,et al.  Strategies for De-identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies , 2012, Medical care.

[17]  Beth M. Sundheim,et al.  Overview of Results of the MUC-6 Evaluation , 1995, MUC.

[18]  Michal Konkol,et al.  CRF-Based Czech Named Entity Recognizer and Consolidation of Czech NER Research , 2013, TSD.