Combining Proper Name-Coreference with Conditional Random Fields for Semi-supervised Named Entity Recognition in Vietnamese Text

Named entity recognition (NER) is the process of seeking to locate atomic elements in text into predefined categories such as the names of persons, organizations and locations.Most existingNERsystems are based on supervised learning. This method often requires a large amount of labelled training data, which is very time-consuming to build. To solve this problem, we introduce a semi-supervised learning method for recognizing named entities in Vietnamese text by combining proper name coreference, named-ambiguityheuristicswithapowerful sequential learningmodel,Conditional RandomFields. Our approach inherits the idea of Liao and Veeramachaneni [6] and expands it by using proper name coreference. Starting by training the model using a small data set that is annotated manually, the learning model extracts high confident named entities and finds low confident ones by using proper name coreference rules. The low confident named entities are put in the training set to learn new context features. The F-scores of the systemfor extracting "Person", "Location" and "Organization" entities are 83.36%, 69.53% and 65.71%when applying heuristics proposed by Liao andVeeramachaneni.Those valueswhen using our proposed heuristics are 93.13%, 88.15% and 79.35%, respectively. It shows that our method is good in increasing the system accuracy.

[1]  Cheng Niu,et al.  A Bootstrapping Approach to Named Entity Classification Using Successive Learners , 2003, ACL.

[2]  Hwee Tou Ng,et al.  One Class per Named Entity: Exploiting Unlabeled Text for Named Entity Recognition , 2007, IJCAI.

[3]  Richard M. Schwartz,et al.  An Algorithm that Learns What's in a Name , 1999, Machine Learning.

[4]  Rebecca Hwa,et al.  Syntax-based Semi-Supervised Named Entity Tagging , 2005, ACL.

[5]  Andrew McCallum,et al.  Confidence Estimation for Information Extraction , 2004, NAACL.

[6]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[7]  David Barber,et al.  Tagging of name records for genealogical data browsing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[8]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[9]  Mathias Rossignol,et al.  An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts , 2010, JEPTALNRECITAL.

[10]  SchwartzRichard,et al.  An Algorithm that Learns Whats in a Name , 1999 .

[11]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[12]  Sriharsha Veeramachaneni,et al.  A Simple Semi-supervised Algorithm For Named Entity Recognition , 2009, HLT-NAACL 2009.

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  Nigel Collier,et al.  Named entity recognition in Vietnamese using classifier voting , 2007, TALIP.

[15]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[16]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.