Scaling Hidden Markov Language Models

The hidden Markov model (HMM) is a fundamental tool for sequence modeling that cleanly separates the hidden state from the emission structure. However, this separation makes it difficult to fit HMMs to large datasets in modern NLP, and they have fallen out of use due to very poor performance compared to fully observed models. This work revisits the challenge of scaling HMMs to language modeling datasets, taking ideas from recent approaches to neural modeling. We propose methods for scaling HMMs to massive state spaces while maintaining efficient exact inference, a compact parameterization, and effective regularization. Experiments show that this approach leads to models that are more accurate than previous HMM and n-gram-based methods, making progress towards the performance of state-of-the-art neural models.

[1]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[2]  Daniel Marcu,et al.  Unsupervised Neural Hidden Markov Models , 2016, SPNLP@EMNLP.

[3]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[4]  Bernard Mérialdo,et al.  Tagging English Text with a Probabilistic Model , 1994, CL.

[5]  Alexander M. Rush,et al.  Learning Neural Templates for Text Generation , 2018, EMNLP.

[6]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[7]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[8]  Dan Klein,et al.  Learning Accurate, Compact, and Interpretable Tree Annotation , 2006, ACL.

[9]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[10]  Sanjeev Saxena,et al.  On Parallel Prefix Computation , 1994, Parallel Process. Lett..

[11]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[12]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[13]  Alexander M. Rush,et al.  Compound Probabilistic Context-Free Grammars for Grammar Induction , 2019, ACL.

[14]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[15]  Heinrich Niemann,et al.  Ergodic hidden Markov models and polygrams for language modeling , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[17]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[18]  Haichen Shen,et al.  TVM: An Automated End-to-End Optimizing Compiler for Deep Learning , 2018 .

[19]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[20]  Yejin Choi,et al.  Bridging HMMs and RNNs through Architectural Transformations , 2018 .

[21]  Kewei Tu,et al.  Dependency Grammar Induction with Neural Lexicalization and Big Training Data , 2017, EMNLP.

[22]  Finale Doshi-Velez,et al.  Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models , 2016, ArXiv.

[23]  Wolfgang Lehrach,et al.  Learning higher-order sequential structure with cloned HMMs , 2019, ArXiv.

[24]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[25]  Karl Stratos,et al.  Discrete Latent Variable Representations for Low-Resource Text Classification , 2020, ACL.

[26]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[27]  Zhongqiang Huang,et al.  Modeling Dependencies in Natural Languages with Latent Variables , 2011 .