The Gutenberg English Poetry Corpus: Exemplary Quantitative Narrative Analyses

This paper describes a corpus of about 3000 English literary texts with about 250 million words extracted from the Gutenberg project that span a range of genres from both fiction and non-fiction written by more than 130 authors (e.g., Darwin, Dickens, Shakespeare). Quantitative Narrative Analysis (QNA) is used to explore a cleaned subcorpus, the Gutenberg English Poetry Corpus (GEPC) which comprises over 100 poetic texts with around 2 million words from about 50 authors (e.g., Keats, Joyce, Wordsworth). Some exemplary QNA studies show author similarities based on latent semantic analysis, significant topics for each author or various text-analytic metrics for George Eliot’s poem ‘How Lisa Loved the King’ and James Joyce’s ’Chamber Music’, concerning e.g. lexical diversity or sentiment analysis. The GEPC is particularly suited for research in Digital Humanities, Natural Language Processing or Neurocognitive Poetics, e.g. as training and test corpus, or for stimulus development and control.

[1]  G. Leech A linguistic guide to English poetry , 1969 .

[2]  Dean Keith Simonton,et al.  Lexical choices and aesthetic success: A computer content analysis of 154 Shakespeare sonnets , 1990, Comput. Humanit..

[3]  A. Jacobs,et al.  Rhetoric, Neurocognitive Poetics, and the Aesthetics of Adaptation , 2017 .

[4]  A. Jacobs,et al.  Measuring the Basic Affective Tone of Poems via Phonological Saliency and Iconicity , 2016 .

[5]  H. Vendler,et al.  The Art of Shakespeare's Sonnets. , 1997 .

[6]  G. Clements Papers in Laboratory Phonology: The role of the sonority cycle in core syllabification , 1990 .

[7]  Arthur M. Jacobs,et al.  Quantifying the Beauty of Words: A Neurocognitive Poetics Perspective , 2017, Front. Hum. Neurosci..

[8]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[9]  Dean Keith Simonto Shakespeare's Sonnets: A Case of and for Single–Case Historiometry , 1989 .

[10]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[13]  A. Jacobs,et al.  ANGST: Affective norms for German sentiment terms, derived from the affective norms for English words , 2014, Behavior research methods.

[14]  J. Ziegler,et al.  Phonological Information Provides Early Sources of Constraint in the Processing of Letter Strings , 1995 .

[15]  Marc Brys,et al.  Moving beyond Kučera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English , 2009 .

[16]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[17]  A. Jacobs,et al.  What makes a metaphor literary? Answers from two computational studies , 2018 .

[18]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[19]  Frédéric Kaplan,et al.  A Simple Set of Rules for Characters and Place Recognition in French Novels , 2017, Front. Digit. Humanit..

[20]  P. Stockwell Cognitive Poetics: An Introduction , 2019 .

[21]  A. Jacobs,et al.  Gehirn und Gedicht : wie wir unsere Wirklichkeiten konstruieren , 2011 .

[22]  A. Jacobs The scientific study of literary experience: Sampling the state of the art , 2015 .

[23]  A. Jacobs,et al.  When we like what we know – A parametric fMRI analysis of beauty and familiarity , 2013, Brain and Language.

[24]  Benny B. Briesemeister,et al.  10 years of BAWLing into affective and aesthetic processes in reading: what are the echoes? , 2015, Front. Psychol..

[25]  A. Jacobs,et al.  “The Brain Is the Prisoner of Thought”: A Machine-Learning Assisted Quantitative Narrative Analysis of Literary Metaphors for Use in Neurocognitive Poetics , 2017 .

[26]  A. Jacobs,et al.  Phonological iconicity , 2014, Front. Psychol..

[27]  A. Jacobs,et al.  What’s in the brain that ink may character ….: A quantitative narrative analysis of Shakespeare’s 154 sonnets for use in (Neuro-)cognitive poetics , 2017 .

[28]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[29]  A. Jacobs Neurocognitive poetics: methods and models for investigating the neuronal and cognitive-affective bases of literature reception , 2015, Front. Hum. Neurosci..

[30]  A. Jacobs,et al.  On the Relation between the General Affective Meaning and the Basic Sublexical, Lexical, and Inter-lexical Features of Poetic Texts—A Case Study Using 57 Poems of H. M. Enzensberger , 2017, Front. Psychol..

[31]  Roel M. Willems,et al.  Caring About Dostoyevsky: The Untapped Potential of Studying Literature , 2016, Trends in Cognitive Sciences.

[32]  Arthur M. Jacobs,et al.  What is the pronunciation for -ough and the spelling for /u/? A database for computing feedforward and feedback consistency in English , 1997 .

[33]  A. Jacobs,et al.  Mood-empathic and aesthetic responses in poetry reception: A model-guided, multilevel, multimethod approach , 2016 .

[34]  Hans van Halteren,et al.  New Machine Learning Methods Demonstrate the Existence of a Human Stylome , 2005, J. Quant. Linguistics.

[35]  Padhraic Smyth,et al.  Combining Background Knowledge and Learned Topics , 2011, Top. Cogn. Sci..

[36]  A. Jacobs,et al.  On Elementary Affective Decisions: To Like Or Not to Like, That Is the Question , 2016, Frontiers in psychology.

[37]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[38]  Josie Billington,et al.  “Shall I compare thee”: The neural basis of literary awareness, and its benefits to cognition , 2015, Cortex.

[39]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[40]  Michael L. Littman,et al.  Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[41]  Benny B. Briesemeister,et al.  Avoid violence, rioting, and outrage; approach celebration, delight, and strength: Using large text corpora to compute valence, arousal, and the basic emotions , 2015, Quarterly journal of experimental psychology.

[42]  Roel M. Willems,et al.  Individual Differences in Sensitivity to Style During Literary Reading: Insights from Eye-Tracking , 2016 .

[43]  Jean-Gabriel Ganascia,et al.  The Logic of the Big Data Turn in Digital Literary Studies , 2015, Front. Digit. Humanit..

[44]  G. Zipf Selected Studies of the Principle of Relative Frequency in Language , 2014 .

[45]  Allan Paivio,et al.  Norms for 204 Literary and 260 Nonliterary Metaphors on 10 Psychological Dimensions , 1988 .

[46]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[47]  R. Jakobson,et al.  « Les Chats » de Charles Baudelaire , 1962 .

[48]  Stefan L. Frank,et al.  Uncertainty Reduction as a Measure of Cognitive Load in Sentence Comprehension , 2013, Top. Cogn. Sci..

[49]  Clovis Gladstone,et al.  Discourses and Disciplines in the Enlightenment: Topic Modeling the French Encyclopédie , 2016, Front. Digit. Humanit..

[50]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[51]  Roel M. Willems,et al.  The Fictive Brain: Neurocognitive Correlates of Engagement in Literature , 2018, Review of General Psychology.

[52]  A. Jacobs,et al.  Syllable structure and sonority in language inventory and aphasic neologisms , 2005, Brain and Language.

[53]  Franco Moretti Graphs, Maps, Trees: Abstract Models for a Literary History , 2005 .

[54]  J. Ziegler,et al.  Pseudohomophone effects provide evidence of early lexico‐phonological processing in visual word recognition , 2009, Human brain mapping.

[55]  A. Jacobs,et al.  Extracting salient sublexical units from written texts: “Emophon,” a corpus-based approach to phonological iconicity , 2013, Front. Psychol..