Universal Word Segmentation: Implementation and Interpretation

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work.

[1]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[2]  H. Abdi,et al.  Principal component analysis , 2010 .

[3]  Yoav Goldberg,et al.  Word Segmentation, Unknown-word Resolution, and Morphological Agreement in a Hebrew Parsing System , 2013, CL.

[4]  Timothy Dozat,et al.  Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task , 2017, CoNLL.

[5]  Wanxiang Che,et al.  The HIT-SCIR System for End-to-End Parsing of Universal Dependencies , 2017, CoNLL Shared Task.

[6]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[7]  Hai Zhao,et al.  Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling , 2006, PACLIC.

[8]  Timothy Baldwin,et al.  Multiword Expressions: A Pain in the Neck for NLP , 2002, CICLing.

[9]  Mikko Kurimo,et al.  Supervised Morphological Segmentation in a Low-Resource Learning Setting using Conditional Random Fields , 2013, CoNLL.

[10]  Joakim Nivre,et al.  From Raw Text to Universal Dependencies - Look, No Tags! , 2017, CoNLL.

[11]  Treebank - 1-Automatic Annotation of MorphoSyntactic Dependencies in a Modern Hebrew , 2008 .

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[14]  Xiang Yu,et al.  IMS at the CoNLL 2017 UD Shared Task: CRFs and Perceptrons Meet Neural Networks , 2017, CoNLL.

[15]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[16]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[17]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[18]  Nizar Habash,et al.  CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies , 2017, CoNLL.

[19]  Christopher D. Manning,et al.  Word Segmentation of Informal Arabic with Domain Adaptation , 2014, ACL.

[20]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[21]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[22]  David D. Palmer,et al.  Chinese Word Segmentation and Information Retrieval , 1997 .

[23]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[24]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[25]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[26]  Alon Itai,et al.  Language resources for Hebrew , 2008, Lang. Resour. Evaluation.

[27]  Jörg Tiedemann,et al.  Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF , 2017, IJCNLP.

[28]  Nizar Habash,et al.  CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing , 2018, LREC.

[29]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[30]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[31]  Yuval Krymolowski,et al.  Automatic Annotation of Morpho-Syntactic Dependencies in a Modern Hebrew Treebank , 2008 .

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  Nianwen Xu,et al.  Chinese Word Segmentation as Character Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..