North Sámi morphological segmentation with low-resource semi-supervised sequence labeling

Semi-supervised sequence labeling is an effective way to train a low-resource morphological segmentation system. We show that a feature set augmentation approach, which combines the strengths of generative and discriminative models, is suitable both for graphical models like conditional random field (CRF) and sequence-to-sequence neural models. We perform a comparative evaluation between three existing and one novel semi-supervised segmentation methods. All four systems are language-independent and have open-source implementations. We improve on previous best results for North Sámi morphological segmentation. We see a relative improvement in morph boundary F1-score of 8.6% compared to using the generative Morfessor FlatCat model directly and 2.4% compared to a seq2seq baseline. Our neural sequence tagging system reaches almost the same performance as the CRF topline.

[1]  Mikko Kurimo,et al.  Empirical Comparison of Evaluation Methods for Unsupervised Learning of Morphology , 2011, TAL.

[2]  TROND TROSTERUD,et al.  Consonant Gradation in Estonian and Sámi : Two-Level Solution , 2005 .

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Mikko Kurimo,et al.  Low-Resource Active Learning of North Sámi Morphological Segmentation , 2015 .

[5]  Katharina Kann,et al.  Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages , 2018, NAACL.

[6]  Kristiina Jokinen,et al.  Low-Resource Active Learning of Morphological Segmentation , 2016 .

[7]  Ryan Cotterell,et al.  Neural Morphological Analysis: Encoding-Decoding Canonical Segments , 2016, EMNLP.

[8]  John DeNero,et al.  A Class-Based Agreement Model for Generating Accurately Inflected Translations , 2012, ACL.

[9]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[10]  Ryan Cotterell,et al.  The SIGMORPHON 2016 Shared Task—Morphological Reinflection , 2016, SIGMORPHON.

[11]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[14]  Mikko Kurimo,et al.  Morfessor FlatCat: An HMM-Based Method for Unsupervised and Semi-Supervised Learning of Morphology , 2014, COLING.

[15]  Gerard de Melo,et al.  Morphological Segmentation with Window LSTM Neural Networks , 2016, AAAI.

[16]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[17]  Samy Bengio,et al.  Tacotron: Towards End-to-End Speech Synthesis , 2017, INTERSPEECH.

[18]  Mikko Kurimo,et al.  Improved Subword Modeling for WFST-Based Speech Recognition , 2017, INTERSPEECH.

[19]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[20]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[21]  Mikko Kurimo,et al.  Painless Semi-Supervised Morphological Segmentation using Conditional Random Fields , 2014, EACL.

[22]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.