Model Criticism for Long-Form Text Generation

Language models have demonstrated the ability to generate highly fluent text; however, it remains unclear whether their output retains coherent high-level structure (e.g., story pro-gression). Here, we propose to apply a statistical tool, model criticism in latent space , to evaluate the high-level structure of the generated text. Model criticism compares the distributions between real and generated data in a latent space obtained according to an assumptive generative process. Different generative processes identify specific failure modes of the underlying model. We perform experiments on three representative aspects of high-level discourse—coherence, coreference, and topicality—and find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference. Toledo Township is a township in County, west are Westmoreland rivers. and township Tuskegee River ...

[1]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, ArXiv.

[2]  Gerard de Melo,et al.  Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models , 2022, ArXiv.

[3]  Mohit Iyyer,et al.  RankGen: Improving Text Generation with Large Ranking Models , 2022, ArXiv.

[4]  Tal Linzen,et al.  When a sentence does not introduce a discourse entity, Transformer-based models still sometimes refer to it , 2022, NAACL.

[5]  Mohit Iyyer,et al.  ChapterBreak: A Challenge Dataset for Long-Range Language Models , 2022, NAACL.

[6]  Wilker Aziz,et al.  Statistical Model Criticism of Variational Auto-Encoders , 2022, ArXiv.

[7]  Tomás Kociský,et al.  Towards Coherent and Consistent Use of Entities in Narrative Generation , 2022, ICML.

[8]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[9]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[10]  DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence , 2022, ArXiv.

[11]  Noah A. Smith,et al.  Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text , 2021, Annual Meeting of the Association for Computational Linguistics.

[12]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[13]  Mohit Iyyer,et al.  Do Long-Range Language Models Actually Use Long-Range Context? , 2021, EMNLP.

[14]  Weizhe Yuan,et al.  BARTScore: Evaluating Generated Text as Text Generation , 2021, NeurIPS.

[15]  Artidoro Pagnoni,et al.  Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics , 2021, NAACL.

[16]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[17]  Yejin Choi,et al.  MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers , 2021, NeurIPS.

[18]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[19]  Elizabeth Clark,et al.  Evaluation of Text Generation: A Survey , 2020, ArXiv.

[20]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[21]  Wilker Aziz,et al.  Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation , 2020, COLING.

[22]  Alexander M. Rush Torch-Struct: Deep Structured Prediction Library , 2020, ACL.

[23]  Graham Neubig,et al.  Understanding Knowledge Distillation in Non-autoregressive Machine Translation , 2019, ICLR.

[24]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[25]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[26]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[27]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[28]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[29]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[30]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[31]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[32]  D. Joel,et al.  The Future of Sex and Gender in Psychology: Five Challenges to the Gender Binary , 2019, The American psychologist.

[33]  Christopher K. I. Williams,et al.  Model Criticism in Latent Space , 2017, Bayesian Analysis.

[34]  Aki Vehtari,et al.  Visualization in Bayesian workflow , 2017, Journal of the Royal Statistical Society: Series A (Statistics in Society).

[35]  Christian Hardmeier,et al.  Entity Decisions in Neural Language Modelling: Approaches and Problems , 2019 .

[36]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Joel R. Tetreault,et al.  Discourse Coherence in the Wild: A Dataset, Evaluation and Methods , 2018, SIGDIAL Conference.

[39]  Franck Dernoncourt,et al.  A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents , 2018, NAACL.

[40]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[41]  Alexander M. Rush,et al.  Adversarially Regularized Autoencoders , 2017, ICML.

[42]  Honglak Lee,et al.  Sentence Ordering and Coherence Modeling using Recurrent Neural Networks , 2016, AAAI.

[43]  Brian Larson,et al.  Gender as a Variable in Natural-Language Processing: Ethical Considerations , 2017, EthNLP@EACL.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Kevin Crowston,et al.  Amazon Mechanical Turk: A Research Tool for Organizations and Information Systems Scholars , 2012, Shaping the Future of ICT Research.

[46]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[47]  David M. Blei,et al.  Bayesian Checking for Topic Models , 2011, EMNLP.

[48]  Vincent Ng,et al.  Modeling Organization in Student Essays , 2010, EMNLP.

[49]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[50]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[51]  Filippo Menczer,et al.  Modeling Statistical Properties of Written Text , 2009, PloS one.

[52]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[53]  Mirella Lapata,et al.  Modeling Local Coherence: An Entity-Based Approach , 2005, ACL.

[54]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[55]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[56]  H. Stern,et al.  Bayesian Model Checking and Model Diagnostics , 2005 .

[57]  Barbara Di Eugenio,et al.  Centering: A Parametric Theory and Its Instantiations , 2004, Computational Linguistics.

[58]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[59]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[60]  Dipak K. Dey,et al.  Simulation Based Model Checking for Hierarchical Models , 2004 .

[61]  Stuart M. Shieber,et al.  Comma Restoration Using Constituency Information , 2003, HLT-NAACL.

[62]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[63]  D. Buring,et al.  BINDING THEORY , 2003 .

[64]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[65]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[66]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[67]  Alvin F. Martin,et al.  The NIST 1999 Speaker Recognition Evaluation - An Overview , 2000, Digit. Signal Process..

[68]  Dipak K. Dey,et al.  A simulation-intensive approach for checking hierarchical models , 1998 .

[69]  Randall Hendrick,et al.  The Representation and Processing of Coreference in Discourse , 1998, Cogn. Sci..

[70]  Scott Weinstein,et al.  Centering: A Framework for Modeling the Local Coherence of Discourse , 1995, CL.

[71]  Robert E. Weiss,et al.  Residuals and Outliers in Repeated Measures Random Effects Models , 1995 .

[72]  G. V. Puskorius,et al.  Truncated backpropagation through time and Kalman filter training for neurocontrol , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[73]  Hermann Ney,et al.  On structuring probabilistic dependences in stochastic language modelling , 1994, Comput. Speech Lang..

[74]  Noam Chomsky Lectures on Government and Binding: The Pisa Lectures , 1993 .

[75]  K. Chaloner,et al.  A Bayesian approach to outlier detection and residual analysis , 1988 .

[76]  George E. P. Box,et al.  Sampling and Bayes' inference in scientific modelling and robustness , 1980 .

[77]  Lauri Karttunen,et al.  Discourse Referents , 1969, COLING.

[78]  H. Hotelling A Generalized T Test and Measure of Multivariate Dispersion , 1951 .