Training Hybrid Language Models by Marginalizing over Segmentations

In this paper, we study the problem of hybrid language modeling, that is using models which can predict both characters and larger units such as character ngrams or words. Using such models, multiple potential segmentations usually exist for a given string, for example one using words and one using characters only. Thus, the probability of a string is the sum of the probabilities of all the possible segmentations. Here, we show how it is possible to marginalize over the segmentations efficiently, in order to compute the true probability of a sequence. We apply our technique on three datasets, comprising seven languages, showing improvements over a strong character level language model.

[1]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[2]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[3]  Jürgen Schmidhuber,et al.  Recurrent Highway Networks , 2016, ICML.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[6]  Yoshua Bengio,et al.  Multiscale sequence modeling with a learned dictionary , 2017, ArXiv.

[7]  Yiming Yang,et al.  Transformer-XL: Language Modeling with Longer-Term Dependency , 2018 .

[8]  Chong Wang,et al.  Sequence Modeling via Segmentations , 2017, ICML.

[9]  Steve Renals,et al.  Multiplicative LSTM for sequence modelling , 2016, ICLR.

[10]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[11]  Aaron C. Courville,et al.  Recurrent Batch Normalization , 2016, ICLR.

[12]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[13]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[14]  Ilya Sutskever,et al.  SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[15]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[17]  Graham Neubig,et al.  Neural Lattice Language Models , 2018, TACL.

[18]  Yoshua Bengio,et al.  Hierarchical Multiscale Recurrent Neural Networks , 2016, ICLR.

[19]  Jason Eisner,et al.  Spell Once, Summon Anywhere: A Two-Level Open-Vocabulary Language Model , 2018, AAAI.

[20]  Chris Dyer,et al.  Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling , 2017, ACL.