Semi-Supervised Induction of POS-Tag Lexicons with Tree Models

We approach the problem of POS tagging of morphologically rich languages in a setting where only a small amount of labeled training data is available. We show that a bigram HMM tagger benefits from re-training on a larger untagged text using Baum-Welch estimation. Most importantly, this estimation can be significantly improved by pre-guessing tags for OOV words based on morphological criteria. We consider two models for this task: a character-based recurrent neural network, which guesses the tag from the string form of the word, and a recently proposed graph-based model of morphological transformations. In the latter, the unknown POS tags can be modeled as latent variables in a way very similar to Hidden Markov Tree models and an analogue of the Forward-Backward algorithm can be formulated, which enables us to compute expected values over unknown taggings. We evaluate both the quality of the induced tag lexicon and its impact on the HMM’s tagging accuracy. In both tasks, the graph-based morphology model performs significantly better than the RNN predictor. This confirms the intuition that morphologically related words provide useful information about an unknown word’s POS tag.

[1]  Manaal Faruqui,et al.  Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning , 2015, TACL.

[2]  Maciej Janicki A Multi-purpose Bayesian Model for Word-Based Morphology , 2015, SFCM.

[3]  Yuji Matsumoto,et al.  Hidden Markov Tree Model for Word Alignment , 2013, WMT@ACL.

[4]  Andrei Mikheev,et al.  Automatic Rule Induction for Unknown-Word Guessing , 1997, CL.

[5]  Sean A. Fulop,et al.  Unsupervised Learning of Morphology Without Morphemes , 2002, SIGMORPHON.

[6]  Gita Martohardjono,et al.  Pace Panini: Towards a Word-Based Theory of Morphology , 1997 .

[7]  Maciej Sumalvico Unsupervised Learning of Morphology with Graph Sampling , 2017, RANLP.

[8]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[9]  Markus Forsberg,et al.  Semi-supervised learning of morphological paradigms and lexicons , 2014, EACL.

[10]  Beáta Megyesi,et al.  The Open Source Tagger HunPoS for Swedish , 2009, NODALIDA.

[11]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[12]  Paulo Gonçalves,et al.  Computational methods for hidden Markov tree models-an application to wavelet trees , 2004, IEEE Transactions on Signal Processing.

[13]  András Kornai,et al.  HunPos: an open source trigram tagger , 2007, ACL 2007.

[14]  John DeNero,et al.  Supervised Learning of Complete Morphological Paradigms , 2013, NAACL.

[15]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[16]  Zdenek Zabokrtský,et al.  Hidden Markov Tree Model in Dependency-based Machine Translation , 2009, ACL/IJCNLP.

[17]  Robert D. Nowak,et al.  Wavelet-based statistical signal processing using hidden Markov models , 1998, IEEE Trans. Signal Process..