论文信息 - Inflection-Tolerant Ontology-Based Named Entity Recognition for Real-Time Applications

Inflection-Tolerant Ontology-Based Named Entity Recognition for Real-Time Applications

A growing number of applications users daily interact with have to operate in (near) real-time: chatbots, digital companions, knowledge work support systems -- just to name a few. To perform the services desired by the user, these systems have to analyze user activity logs or explicit user input extremely fast. In particular, text content (e.g. in form of text snippets) needs to be processed in an information extraction task. Regarding the aforementioned temporal requirements, this has to be accomplished in just a few milliseconds, which limits the number of methods that can be applied. Practically, only very fast methods remain, which on the other hand deliver worse results than slower but more sophisticated Natural Language Processing (NLP) pipelines. In this paper, we investigate and propose methods for real-time capable Named Entity Recognition (NER). As a first improvement step we address are word variations induced by inflection, for example present in the German language. Our approach is ontology-based and makes use of several language information sources like Wiktionary. We evaluated it using the German Wikipedia (about 9.4B characters), for which the whole NER process took considerably less than an hour. Since precision and recall are higher than with comparably fast methods, we conclude that the quality gap between high speed methods and sophisticated NLP pipelines can be narrowed a bit more without losing too much runtime performance.

[1] Ansgar Bernardi,et al. Overview and Outlook on the Semantic Desktop , 2005, Semantic Desktop Workshop.

[2] Iryna Gurevych,et al. Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary , 2008, LREC.

[3] Jakub Piskorski,et al. Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish , 2010, IIS 2010.

[4] Kalina Bontcheva,et al. Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics , 2013, PLoS Comput. Biol..

[5] Arindam Dey,et al. Named Entity Recognition using Gazetteer Method and N-gram Technique for an Inflectional Language: A Hybrid Approach , 2013 .

[6] Jakob Nielsen,et al. Usability engineering , 1997, The Computer Science and Engineering Handbook.

[7] José Luis Martínez-Fernández,et al. A real time Named Entity Recognition system for Arabic text mining , 2011, Language Resources and Evaluation.

[8] Andreas Dengel,et al. Context Spaces as the Cornerstone of a Near-Transparent & Self-Reorganizing Semantic Desktop , 2018, ESWC.

[9] Steven P. Abney. Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[10] Alfred V. Aho,et al. Efficient string matching , 1975, Commun. ACM.

[11] Giang Nguyen,et al. Next Improvement Towards Linear Named Entity Recognition Using Character Gazetteers , 2014, ICCSAMA.

[12] L. Sauermann,et al. PIMO-a Framework for Representing Personal Information Models , 2007 .

[13] Giang Nguyen,et al. Character gazetteer for Named Entity Recognition with linear matching complexity , 2013, 2013 Third World Congress on Information and Communication Technologies (WICT 2013).

[14] Marti A. Hearst. Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[15] Steven Skiena,et al. SpeedRead: A Fast Named Entity Recognition Pipeline , 2012, COLING.

[16] Jörg Caumanns,et al. A fast and simple stemming algorithm for German words , 1999 .

[17] Christian Bizer,et al. DBpedia spotlight: shedding light on the web of documents , 2011, I-Semantics '11.