BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

We show that BERT (Devlin et al., 2018) is a Markov random field language model. This formulation gives way to a natural procedure to sample sentences from BERT. We generate from BERT and find that it can produce high-quality, fluent generations. Compared to the generations of a traditional left-to-right language model, BERT generates sentences that are more diverse but of slightly worse quality.

[1]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[2]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[3]  Pascal Vincent,et al.  Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[4]  Dan Klein,et al.  Multilingual Constituency Parsing with Self-Attention and Pre-Training , 2018, ACL.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[7]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[8]  Pascal Vincent,et al.  A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[9]  J. Besag Efficiency of pseudolikelihood estimation for simple Gaussian fields , 1977 .

[10]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[11]  Ruslan Salakhutdinov,et al.  Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[12]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[13]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[14]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[15]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Tapani Raiko,et al.  Parallel tempering is efficient for learning restricted Boltzmann machines , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Kyunghyun Cho,et al.  Passage Re-ranking with BERT , 2019, ArXiv.

[19]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[20]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  Alexander M. Rush,et al.  A Fast Variational Approach for Learning Markov Random Field Language Models , 2015, ICML.

[23]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[24]  Wang,et al.  Replica Monte Carlo simulation of spin glasses. , 1986, Physical review letters.