Effective vector representation for the Korean named-entity recognition

Abstract Named-entity recognition, part of information extraction, is the task of finding the position of a proper names in a sentence and assigning it to the correct category. Existing studies have access to Korean named-entity recognition by a morphological-level method that performs named-entity recognition processes by using the results of morphological analysis as input. While this method has the advantage of using various linguistic clues, it suffers from the error propagation problem of morphological analysis. In this paper, we propose an effective method for Korean syllable-level named-entity recognition to solve the above problem. Firstly, we suggest an approach to use the syllable bi-gram vector representation for Korean syllable-level named-entity recognition. Secondly, influenced by the linguistic characteristics of Korean, we suggest a novel way to make the joint vector representation of syllable bi-gram and Korean eojeol’s positional information. In the experiment, we have evaluated our methods on the two Korean named-entity recognition corpora using Bi-directional LSTM-CRFs as a sequence labeler. Experimental results verify that our methods significantly improve the performance of syllable-level named-entity recognition and have similar performance to existing morphological-level named-entity recognition. Besides, additional experiments have shown that our syllable-level named-entity recognition is not only more robust but also faster than traditional morphological-level named-entity recognition by eliminating the morphological analysis process.

[1]  Hyuk-Chul Kwon,et al.  Stochastic Korean Word-Spacing with Smoothing Using Korean Spelling Checker , 2004, Int. J. Comput. Process. Orient. Lang..

[2]  Changki Lee,et al.  Fine-Grained Named Entity Recognition Using Conditional Random Fields for Question Answering , 2006, AIRS.

[3]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[4]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[5]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[6]  Ko Youngjoong,et al.  Expansion of Word Representation for Named Entity Recognition Based on Bidirectional LSTM CRFs , 2017 .

[7]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8]  Mónica Marrero,et al.  Evaluation of Named Entity Extraction Systems , 2009 .

[9]  Jungyun Seo,et al.  A Korean Named Entity Recognizer using Weighted Voting based Ensemble Technique , 2018 .

[10]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[11]  Jungyun Seo,et al.  A statistical prediction model of speakers' intentions using multi-level features in a goal-oriented dialog system , 2012, Pattern Recognit. Lett..

[12]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[13]  So-Young Park,et al.  Named Entity and Event Annotation Tool for Cultural Heritage Information Corpus Construction , 2012 .

[14]  Seung-Shik Kang,et al.  Automatic Segmentation of Words using Syllable Bigram Statistics , 2001, NLPRS.

[15]  Florian Coulmas,et al.  Writing Systems: An Introduction to Their Linguistic Analysis , 2002 .

[16]  Ho-min Sohn,et al.  The Korean language , 1999 .

[17]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[18]  Edward J. Vajda,et al.  Writing systems , 2017, Turkic.

[19]  Pabitra Mitra,et al.  A composite kernel for named entity recognition , 2010, Pattern Recognit. Lett..

[20]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[21]  Mark Dredze,et al.  Entity Disambiguation for Knowledge Base Population , 2010, COLING.

[22]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[23]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[24]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[25]  Hyuk-Chul Kwon,et al.  Improving partial parsing based on error-pattern analysis for a Korean grammar-checker , 2003, TALIP.

[26]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[27]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition , 2002, CoNLL.

[28]  Eunjeong Lucy Park,et al.  KoNLPy: Korean natural language processing in Python , 2014 .