Do Massively Pretrained Language Models Make Better Storytellers?

Large neural language models trained on massive amounts of text have emerged as a formidable strategy for Natural Language Understanding tasks. However, the strength of these models as Natural Language Generators is less clear. Though anecdotal evidence suggests that these models generate better quality text, there has been no detailed study characterizing their generation abilities. In this work, we compare the performance of an extensively pretrained model, OpenAI GPT2-117 (Radford et al., 2019), to a state-of-the-art neural story generation model (Fan et al., 2018). By evaluating the generated text across a wide variety of automatic metrics, we characterize the ways in which pretrained models do, and do not, make better storytellers. We find that although GPT2-117 conditions more strongly on context, is more sensitive to ordering of events, and uses more unusual words, it is just as likely to produce repetitive and under-diverse text when using likelihood-maximizing decoding algorithms.

[1]  R. P. Fishburne,et al.  Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel , 1975 .

[2]  Mirella Lapata,et al.  Modeling Local Coherence: An Entity-Based Approach , 2005, ACL.

[3]  W. Nagy,et al.  Syntactic complexity as a predictor of adolescent writing quality: Which measures? Which genre? , 2009 .

[4]  Philip M. McCarthy,et al.  Linguistic Features of Writing Quality , 2010 .

[5]  J. Pennebaker,et al.  Language style matching in writing: synchrony in essays, correspondence, and poetry. , 2010, Journal of personality and social psychology.

[6]  Amy Beth Warriner,et al.  Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[7]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[8]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[9]  Joelle Pineau,et al.  How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation , 2016, EMNLP.

[10]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[11]  Daniel Jurafsky,et al.  Mutual Information and Diverse Decoding Improve Neural Machine Translation , 2016, ArXiv.

[12]  Jianfeng Gao,et al.  A Diversity-Promoting Objective Function for Neural Conversation Models , 2015, NAACL.

[13]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[14]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[15]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[16]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[17]  R. Swanson,et al.  Evaluating Story Generation Systems Using Automated Linguistic Analyses , 2017 .

[18]  Alan Ritter,et al.  Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints , 2018, EMNLP.

[19]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[20]  Adam Coates,et al.  Cold Fusion: Training Seq2Seq Models Together with Language Models , 2017, INTERSPEECH.

[21]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[22]  M. de Rijke,et al.  Why are Sequence-to-Sequence Models So Dull? Understanding the Low-Diversity Problem of Chatbots , 2018, SCAI@EMNLP.

[23]  Xueqi Cheng,et al.  Learning to Control the Specificity in Neural Response Generation , 2018, ACL.

[24]  Jason Weston,et al.  Importance of a Search Strategy in Neural Dialogue Modelling , 2018, ArXiv.

[25]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[26]  Kyunghyun Cho,et al.  Importance of Search and Evaluation Strategies in Neural Dialogue Modeling , 2018, INLG.

[27]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[28]  Garrison W. Cottrell,et al.  Improving Neural Story Generation by Targeted Common Sense Grounding , 2019, EMNLP.

[29]  Jason Weston,et al.  What makes a good conversation? How controllable attributes affect human judgments , 2019, NAACL.

[30]  Alexander M. Rush,et al.  GLTR: Statistical Detection and Visualization of Generated Text , 2019, ACL.

[31]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Alexander M. Rush,et al.  Encoder-Agnostic Adaptation for Conditional Language Generation , 2019, ArXiv.

[34]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[35]  Joelle Pineau,et al.  Language GANs Falling Short , 2018, ICLR.

[36]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.