Fine-grained Dutch named entity recognition

This paper describes the creation of a fine-grained named entity annotation scheme and corpus for Dutch, and experiments on automatic main type and subtype named entity recognition. We give an overview of existing named entity annotation schemes, and motivate our own, which describes six main types (persons, organizations, locations, products, events and miscellaneous named entities) and finer-grained information on subtypes and metonymic usage. This was applied to a one-million-word subset of the Dutch SoNaR reference corpus. The classifier for main type named entities achieves a micro-averaged F-score of 84.91 %, and is publicly available, along with the corpus and annotations.

[1]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[2]  Tiejun Zhao,et al.  Biomedical Named Entity Recognition Based on Classifiers Ensemble , 2008, Int. J. Comput. Sci. Appl..

[3]  Mirella Lapata,et al.  Ensemble Methods for Unsupervised WSD , 2006, ACL.

[4]  Darrell Whitley,et al.  A genetic algorithm tutorial , 1994, Statistics and Computing.

[5]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[6]  Malvina Nissim,et al.  Learning to buy a Renault and talk to BMW: A supervised approach to conventional metonymy , 2005 .

[7]  Nancy A. Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[8]  Walter Daelemans,et al.  TiMBL: Tilburg Memory-Based Learner, version 2.0, Reference guide , 1998 .

[9]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[10]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[11]  Veronique Hoste,et al.  Optimization issues in machine learning of coreference resolution , 2005 .

[12]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[13]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[14]  Alexander S. Yeh,et al.  More accurate tests for the statistical significance of result differences , 2000, COLING.

[15]  Christoph Müller,et al.  Multi-level annotation of linguistic data with MMAX 2 , 2006 .

[16]  Daniel S. Weld,et al.  Fine-Grained Entity Recognition , 2012, AAAI.

[17]  Walter Daelemans,et al.  Evaluation of Machine Learning Methods for Natural Language Processing Tasks , 2002, LREC.

[18]  Satoshi Sekine,et al.  Definition, Dictionaries and Tagger for Extended Named Entity Hierarchy , 2004, LREC.

[19]  Michael Fleischman Automated Subcategorization of Named Entities , 2001, ACL.

[20]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[21]  Nelleke Oostdijk,et al.  From D-Coi to SoNaR: a reference corpus for Dutch , 2008, LREC.

[22]  Yuji Matsumoto,et al.  Japanese Named Entity Extraction with Redundant Morphological Analysis , 2003, NAACL.

[23]  Thierry Poibeau,et al.  Proper Name Extraction from Non-Journalistic Texts , 2000, CLIN.

[24]  Inderjeet Mani,et al.  2003 Standard for the Annotation of Temporal Expressions , 2004 .

[25]  Changki Lee,et al.  Fine-Grained Named Entity Recognition Using Conditional Random Fields for Question Answering , 2006, AIRS.

[26]  Stan Matwin,et al.  Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity , 2006, Canadian AI.

[27]  Isabelle Delaere,et al.  Cultivating trees: adding several semantic layers to the Lassy treebank in SoNaR , 2008 .

[28]  Erik F. Tjong Kim Sang,et al.  Memory-Based Shallow Parsing , 2002, J. Mach. Learn. Res..

[29]  Véronique Hoste,et al.  Towards a Balanced Named Entity Corpus for Dutch , 2010, LREC.

[30]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[31]  Toine Bogers,et al.  Dutch Named Entity Recognition: Optimizing Features, Algorithms, and Output , 2004 .

[32]  S. T. Buckland,et al.  Computer-Intensive Methods for Testing Hypotheses. , 1990 .

[33]  Simone Paolo Ponzetto,et al.  Large-Scale Taxonomy Mapping for Restructuring and Integrating Wikipedia , 2009, IJCAI.

[34]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[35]  Walter Daelemans,et al.  Memory-Based Named Entity Recognition using Unannotated Data , 2003, CoNLL.

[36]  Walter Daelemans,et al.  A Named Entity Recognition System for Dutch , 2001, CLIN.

[37]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[38]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[39]  Bing Liu,et al.  Sentiment Analysis and Subjectivity , 2010, Handbook of Natural Language Processing.

[40]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[41]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[42]  Eduard H. Hovy,et al.  Fine Grained Classification of Named Entities , 2002, COLING.

[43]  Walter Daelemans,et al.  GAMBL, genetic algorithm optimization of memory-based WSD , 2004, SENSEVAL@ACL.

[44]  Malvina Nissim,et al.  Towards a Corpus Annotated for Metonymies: the Case of Location Names , 2002, LREC.

[45]  Vincent Ng,et al.  Unsupervised Models for Coreference Resolution , 2008, EMNLP.

[46]  Suresh Manandhar,et al.  An Unsupervised Method for General Named Entity Recognition and Automated Concept Discovery , 2004 .

[47]  Asif Ekbal,et al.  Assessing the Challenge of Fine-Grained Named Entity Recognition and Classification , 2010, NEWS@ACL.

[48]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[49]  Xavier Carreras,et al.  Named Entity Extraction using AdaBoost , 2002, CoNLL.

[50]  Simone Paolo Ponzetto,et al.  Joining Forces Pays Off: Multilingual Joint Word Sense Disambiguation , 2012, EMNLP.

[51]  Scott M. Smith,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1989 .

[52]  Vincent Ng,et al.  Supervised Models for Coreference Resolution , 2009, EMNLP.

[53]  Yuji Matsumoto,et al.  Fast Methods for Kernel-Based Text Analysis , 2003, ACL.

[54]  Joel Nothman,et al.  Analysing Wikipedia and Gold-Standard Corpora for NER Training , 2009, EACL.

[55]  Asif Ekbal,et al.  Maximum Entropy Classifier Ensembling using Genetic Algorithm for NER in Bengali , 2010, LREC.

[56]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[57]  Satoshi Sekine,et al.  Named Entity Discovery Using Comparable News Articles , 2004, COLING.