Evaluation of Named Entity Recognition in Dutch Online Criminal Complaints

The possibility for citizens to submit crime reports and criminal complaints online is becoming ever more common, especially for cyber- and internet-related crimes such as phishing and online trade fraud. Such user-submitted crime reports contain references to entities of interest, such as the complainant, counterparty, items being traded, and locations. Using named entity recognition (NER) algorithms these entities can be identified and used in further information extraction and legal reasoning. This paper describes an evaluation of the de facto standard NER algorithm for Dutch on crime reports provided by the Dutch police. An analysis of confusion in entity type assignment and recall errors is presented, as well as suggestions for performance improvement. Besides traditional evaluation based on a manually created gold standard, an alternative assessment method is performed to allow for more efficient evaluation and error analysis. The paper concludes with a general discussion on the use of NER in information extraction.

[1]  Ngoc Thanh Nguyen,et al.  A combination of active learning and self-learning for named entity recognition on Twitter using conditional random fields , 2017, Knowl. Based Syst..

[2]  Kalina Bontcheva,et al.  Crowdsourcing Named Entity Recognition and Entity Linking Corpora , 2017 .

[3]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[4]  Nelleke Oostdijk,et al.  The Construction of a 500-Million-Word Reference Corpus of Contemporary Written Dutch , 2013, Essential Speech and Language Technology for Dutch.

[5]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[6]  Floris Bex,et al.  Criminal Complaints : From Natural Dialogues to Structured Scenarios , 2016 .

[7]  Véronique Hoste,et al.  Towards a Balanced Named Entity Corpus for Dutch , 2010, LREC.

[8]  Douglas W. Oard,et al.  Evaluation of information retrieval for E-discovery , 2010, Artificial Intelligence and Law.

[9]  Thierry Poibeau,et al.  Proper Name Extraction from Non-Journalistic Texts , 2000, CLIN.

[10]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[11]  Véronique Hoste,et al.  Fine-grained Dutch named entity recognition , 2014, Lang. Resour. Evaluation.

[12]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[13]  Walter Daelemans,et al.  An efficient memory-based morphosyntactic tagger and parser for Dutch , 2007, CLIN 2007.

[14]  Véronique Hoste,et al.  Interacting Semantic Layers of Annotation in SoNaR, a Reference Corpus of Contemporary Written Dutch , 2010, LREC.

[15]  Xavier Carreras,et al.  Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling , 2005, CoNLL.

[16]  Hongyu Guo,et al.  The Unreasonable Effectiveness of Word Representations for Twitter Named Entity Recognition , 2015, NAACL.

[17]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[18]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.