From Word Segmentation to POS Tagging for Vietnamese

This paper presents an empirical comparison of two strategies for Vietnamese Part-of-Speech (POS) tagging from unsegmented text: (i) a pipeline strategy where we consider the output of a word segmenter as the input of a POS tagger, and (ii) a joint strategy where we predict a combined segmentation and POS tag for each syllable. We also make a comparison between state-of-the-art (SOTA) feature-based and neural network-based models. On the benchmark Vietnamese treebank (Nguyen et al., 2009), experimental results show that the pipeline strategy produces better scores of POS tagging from unsegmented text than the joint strategy, and the highest accuracy is obtained by using a feature-based model.

[1]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[2]  Dinh Dien,et al.  Improving Vietnamese POS tagging by integrating a rich feature set and Support Vector Machines , 2008, 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies.

[3]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[4]  Oanh Thi Tran,et al.  An Experimental Study on Vietnamese POS Tagging , 2009, 2009 International Conference on Asian Language Processing.

[5]  Nguyen Van Toan,et al.  Vietnamese Word Segmentation , 2001, NLPRS.

[6]  Phuong-Thai Nguyen,et al.  Building a Large Syntactically-Annotated Corpus of Vietnamese , 2009, Linguistic Annotation Workshop.

[7]  Tuan-Anh Nguyen,et al.  NNVLP: A Neural Network-Based Vietnamese Language Processing Toolkit , 2017, IJCNLP.

[8]  Li Lin,et al.  Probabilistic ensemble learning for vietnamese word segmentation , 2014, SIGIR.

[9]  Stephen Clark,et al.  Joint Word Segmentation and POS Tagging Using a Single Perceptron , 2008, ACL.

[10]  Kazuhide Yamamoto,et al.  Fundamental tools and resource are available for Vietnamese analysis , 2016, 2016 International Conference on Asian Language Processing (IALP).

[11]  Dai Quoc Nguyen,et al.  A Robust Transformation-Based Learning Approach Using Ripple Down Rules for Part-Of-Speech Tagging , 2014, AI Commun..

[12]  P. Compton,et al.  A philosophical basis for knowledge acquisition , 1990 .

[13]  Le Minh Nguyen,et al.  A Semi-supervised Learning Method for Vietnamese Part-of-Speech Tagging , 2010, 2010 Second International Conference on Knowledge and Systems Engineering.

[14]  Martin Potthast,et al.  CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2018, CoNLL.

[15]  Hinrich Schütze,et al.  Efficient Higher-Order CRFs for Morphological Tagging , 2013, EMNLP.

[16]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[17]  Kunihiko Hiraishi,et al.  Dual Decomposition for Vietnamese Part-of-Speech Tagging , 2013, KES.

[18]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[19]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[20]  Oanh Thi Tran,et al.  Improving Vietnamese Word Segmentation and POS Tagging using MEM with Various Kinds of Resources , 2010 .

[21]  Mathias Rossignol,et al.  An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts , 2010, JEPTALNRECITAL.

[22]  Son Bao Pham,et al.  A Hybrid Approach to Vietnamese Word Segmentation Using Part of Speech Tags , 2009, 2009 International Conference on Knowledge and Systems Engineering.

[23]  Minh Le Nguyen,et al.  From Treebank Conversion to Automatic Dependency Parsing for Vietnamese , 2014, NLDB.

[24]  Dat Quoc Nguyen,et al.  A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency Parsing , 2017, CoNLL.

[25]  Hô Tuòng Vinh,et al.  A Hybrid Approach to Word Segmentation of Vietnamese Texts , 2008, LATA.

[26]  Kiem Hoang,et al.  POS-Tagger for English-Vietnamese Bilingual Corpus , 2003, ParallelTexts@NAACL-HLT.

[27]  Anh-Cuong Le,et al.  An Experimental Investigation of Part-Of-Speech Taggers for Vietnamese , 2016, ArXiv.

[28]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[29]  Hitoshi Isahara,et al.  An Error-Driven Word-Character Hybrid Model for Joint Chinese Word Segmentation and POS Tagging , 2009, ACL/IJCNLP.

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[32]  Dinh Dien,et al.  A maximum entropy approach for vietnamese word segmentation , 2006, 2006 International Conference onResearch, Innovation and Vision for the Future.

[33]  Dai Quoc Nguyen,et al.  Ripple Down Rules for Part-of-Speech Tagging , 2011, CICLing.

[34]  Dai Quoc Nguyen,et al.  A Fast and Accurate Vietnamese Word Segmenter , 2017, LREC.

[35]  Jörg Tiedemann,et al.  Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF , 2017, IJCNLP.

[36]  Dai Quoc Nguyen,et al.  RDRPOSTagger: A Ripple Down Rules-based Part-Of-Speech Tagger , 2014, EACL.

[37]  Trung-Kien Nguyen,et al.  Vietnamese Word Segmentation with CRFs and SVMs: An Investigation , 2006, PACLIC.

[38]  Anh-Cuong Le,et al.  A hybrid approach to Vietnamese word segmentation , 2016, 2016 IEEE RIVF International Conference on Computing & Communication Technologies, Research, Innovation, and Vision for the Future (RIVF).