论文信息 - BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model - 字舞流文

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

We show that BERT (Devlin et al., 2018) is a Markov random field language model. This formulation gives way to a natural procedure to sample sentences from BERT. We generate from BERT and find that it can produce high-quality, fluent generations. Compared to the generations of a traditional left-to-right language model, BERT generates sentences that are more diverse but of slightly worse quality.

Alex Wang | Kyunghyun Cho | Kyunghyun Cho | Alex Wang

[1] Lantao Yu,et al. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[2] Yann Dauphin,et al. Hierarchical Neural Story Generation , 2018, ACL.

[3] Pascal Vincent,et al. Tempered Markov Chain Monte Carlo for training of Restricted Boltzmann Machines , 2010, AISTATS.

[4] Dan Klein,et al. Multilingual Constituency Parsing with Self-Attention and Pre-Training , 2018, ACL.

[5] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6] Yoav Goldberg,et al. Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[7] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[8] Pascal Vincent,et al. A Connection Between Score Matching and Denoising Autoencoders , 2011, Neural Computation.

[9] J. Besag. Efficiency of pseudolikelihood estimation for simple Gaussian fields , 1977 .

[10] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[11] Ruslan Salakhutdinov,et al. Learning in Markov Random Fields using Tempered Transitions , 2009, NIPS.

[12] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[13] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[14] Pascal Vincent,et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[15] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16] Tapani Raiko,et al. Parallel tempering is efficient for learning restricted Boltzmann machines , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[17] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18] Kyunghyun Cho,et al. Passage Re-ranking with BERT , 2019, ArXiv.

[19] Lei Zheng,et al. Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[20] Radford M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[21] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22] Alexander M. Rush,et al. A Fast Variational Approach for Learning Markov Random Field Language Models , 2015, ICML.

[23] John Scott Bridle,et al. Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[24] Wang,et al. Replica Monte Carlo simulation of spin glasses. , 1986, Physical review letters.