Natural Language Statistical Features of LSTM-Generated Texts

Long short-term memory (LSTM) networks have recently shown remarkable performance in several tasks that are dealing with natural language generation, such as image captioning or poetry composition. Yet, only few works have analyzed text generated by LSTMs in order to quantitatively evaluate to which extent such artificial texts resemble those generated by humans. We compared the statistical structure of LSTM-generated language to that of written natural language, and to those produced by Markov models of various orders. In particular, we characterized the statistical structure of language by assessing word-frequency statistics, long-range correlations, and entropy measures. Our main finding is that while both LSTM- and Markov-generated texts can exhibit features similar to real ones in their word-frequency statistics and entropy measures, LSTM-texts are shown to reproduce long-range correlations at scales comparable to those found in natural language. Moreover, for LSTM networks, a temperature-like parameter controlling the generation process shows an optimal value—for which the produced texts are closest to real language—consistent across different statistical features investigated.

[1]  John DeNero,et al.  An Analysis of the Ability of Statistical Language Models to Capture the Structural Properties of Language , 2016, INLG.

[2]  R. Mantegna,et al.  Long-range correlation properties of coding and noncoding DNA sequences: GenBank analysis. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[3]  F. Jelinek,et al.  Perplexity—a measure of the difficulty of speech recognition tasks , 1977 .

[4]  Aaron D. Wyner,et al.  Some asymptotic properties of the entropy of a stationary ergodic data source with applications to data compression , 1989, IEEE Trans. Inf. Theory.

[5]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[6]  R. Voss,et al.  ‘1/fnoise’ in music and speech , 1975, Nature.

[7]  Ehud Reiter,et al.  Book Reviews: Building Natural Language Generation Systems , 2000, CL.

[8]  Kumiko Tanaka-Ishii,et al.  Do neural nets learn statistical laws behind natural language? , 2017, PloS one.

[9]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[10]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[11]  Peter Grassberger,et al.  Entropy estimation of symbol sequences. , 1996, Chaos.

[12]  Eduardo G. Altmann,et al.  On the origin of long-range correlations in texts , 2012, Proceedings of the National Academy of Sciences.

[13]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[14]  William Ralph Bennett Scientific and Engineering Problem-Solving with the Computer , 1976 .

[15]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[16]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[17]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[18]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Anna Rumshisky,et al.  GhostWriter: Using an LSTM for Automatic Rap Lyric Generation , 2015, EMNLP.

[20]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[21]  Emanuele Caglioti,et al.  An example of mathematical authorship attribution , 2008 .

[22]  Max Tegmark,et al.  Critical Behavior in Physics and Probabilistic Formal Languages , 2016, Entropy.

[23]  N. Merhav,et al.  A Measure of Relative Entropy between Individual Sequences with Application to Universal Classification , 1993, Proceedings. IEEE International Symposium on Information Theory.

[24]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[25]  Werner Ebeling,et al.  Long-range correlations between letters and sentences in texts , 1995 .

[26]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[27]  Margaret A. Boden,et al.  Creativity and Artificial Intelligence , 1998, IJCAI.

[28]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[29]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[30]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[31]  David Sharp,et al.  Ngram and Bayesian Classification of Documents for Topic and Authorship , 2003, Lit. Linguistic Comput..

[32]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[33]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[34]  Eduardo G. Altmann,et al.  Stochastic model for the vocabulary growth in natural languages , 2012, ArXiv.

[35]  Antonio Neme,et al.  Stylistics analysis and authorship attribution algorithms based on self-organizing maps , 2015, Neurocomputing.

[36]  Mirella Lapata,et al.  Chinese Poetry Generation with Recurrent Neural Networks , 2014, EMNLP.

[37]  Marco Baroni,et al.  Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks , 2017, ICLR 2018.

[38]  Mirko Degli Esposti,et al.  The Puzzle of Basil’s Epistula 38: A Mathematical Approach to a Philological Problem , 2013, J. Quant. Linguistics.

[39]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[40]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[41]  Marcelo A. Montemurro,et al.  Long-range fractal correlations in literary corpora , 2002, ArXiv.

[42]  Yuri M. Suhov,et al.  Nonparametric Entropy Estimation for Stationary Processesand Random Fields, with Applications to English Text , 1998, IEEE Trans. Inf. Theory.

[43]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[44]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[46]  C. Peng,et al.  Long-range correlations in nucleotide sequences , 1992, Nature.

[47]  Marco Baroni,et al.  Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks , 2017, ICML.

[48]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .

[49]  Fei-Fei Li,et al.  Visualizing and Understanding Recurrent Networks , 2015, ArXiv.

[50]  François Pachet,et al.  Generating Non-plagiaristic Markov Sequences with Max Order Sampling , 2016 .