Dictionary-based Named Entity Recognition

An important task in information extraction is the recognition of named entities in natural language texts, NER for short. A named entity is a phrase presenting an item of a class. This work represents a dictionary-based NER framework. It uses multiple dictionaries, which are freely available on the Web. A dictionary is a collection of phrases that describe named entities. The framework is composed of two stages: (1) detection of named entity candidates using dictionaries for lookups and (2) filtering of false positives based on a part-of-speech tagger. Dictionary lookups are performed using an efficient prefix-tree data structure. Optionally, additional filters using word-form-based evidence can be applied to increase precision and recall of the recognition. Most of the existing approaches for NER use machine learning techniques. The main drawback of these systems is the manual effort needed for the creation of labeled training data. Our dictionary-based recognizer does not need labeled text as training data. Furthermore, the dictionary-based framework can be applied to any language that is supported by a part-of-speech tagger. Our dictionary-based recognizer performs on German with up to 89.01% precision at 77.64% recall and 81.60% F1 score, improving Stanford’s NER by five percentage points for precision, recall, and F1 score.

[1]  Christine Thielen,et al.  An Approach to Proper Name Tagging for German , 1995, cmp-lg/9506024.

[2]  Gerhard Weikum,et al.  Robust Disambiguation of Named Entities in Text , 2011, EMNLP.

[3]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[4]  Simon Clematide,et al.  Learn - Filter - Apply - Forget. Mixed Approaches to Named Entity Recognition , 2001, NLDB.

[5]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[6]  Tong Zhang,et al.  Text Chunking based on a Generalization of Winnow , 2002, J. Mach. Learn. Res..

[7]  Satoshi Sekine,et al.  Named Entity Discovery Using Comparable News Articles , 2004, COLING.

[8]  L. F. Rau,et al.  Extracting company names from text , 1991, [1991] Proceedings. The Seventh IEEE Conference on Artificial Intelligence Application.

[9]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[10]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[11]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[12]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[13]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[14]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[15]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[16]  Helmut Schmid,et al.  Improvements in Part-of-Speech Tagging with an Application to German , 1999 .

[17]  James Allan,et al.  Using Soundex Codes for Indexing Names in ASR Documents , 2004, HLT-NAACL 2004.

[18]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[19]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[20]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[21]  Gerhard Weikum,et al.  HYENA: Hierarchical Type Classification for Entity Names , 2012, COLING.

[22]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[23]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[24]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[25]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[26]  Massimiliano Ciaramita,et al.  A framework for benchmarking entity-annotation systems , 2013, WWW.

[27]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[28]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[29]  Manaal Faruqui,et al.  Training and Evaluating a German Named Entity Recognizer with Semantic Generalization , 2010, KONVENS.

[30]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.