Computationally discriminating literary from non-literary texts

Three computational linguistic methods are presented to discriminate literary from non-literary texts. In the first study, a hierarchical clustering technique of results obtained from Latent Semantic Analysis showed a clustering of literary versus non-literary texts. The second study used the frequencies of shared bigrams across the text, resulting in a 100% correct classification of literary versus non-literary texts. The third study used unigrams yielding a 94% correct classification into literary versus non-literary texts. The final two studies using a larger sample of texts showed that the high classification performance cannot be attributed to specific texts. These findings provide evidence that distinguishing literature from non-literature can be done with high accuracy and with relatively simple computational linguistic techniques.

[1]  M. Stubbs Conrad in the computer: examples of quantitative stylistic methods , 2005, The Language and Literature Reader.

[2]  M. Louwerse,et al.  How cognitive is cognitive poetics? The interaction between symbolic and embodied cognition , 2009 .

[3]  Max M. Louwerse,et al.  Unigrams, bigrams and LSA: Corpus linguistic explorations of genres in Shakespeare's plays , 2008 .

[4]  Danielle S. McNamara,et al.  Handbook of latent semantic analysis , 2007 .

[5]  Max M. Louwerse,et al.  Multi-dimensional register classification using bigrams , 2007 .

[6]  J. Zane The Top Ten: Writers Pick Their Favorite Books , 2007 .

[7]  Max M. Louwerse,et al.  Semantic Variation in Idiolect and Sociolect: Corpus Linguistic Evidence from Literary Texts , 2004, Comput. Humanit..

[8]  P. Kantor Foundations of Statistical Natural Language Processing , 2001, Information Retrieval.

[9]  J. Sinclair Trust the text , 2002 .

[10]  Nigel Fabb Language and Literary Structure: Bibliography , 2002 .

[11]  Walter Kintsch,et al.  8. On the notions of theme and topic in psychological process models of text comprehension , 2002 .

[12]  Thomas K. Landauer,et al.  On the computational basis of learning and cognition: Arguments from LSA , 2002 .

[13]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[14]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[15]  W. Louw Irony in the Text or Insincerity in the Writer? — The Diagnostic Potential of Semantic Prosodies , 1993 .

[16]  Brian Boyd Tolstoy and Nabokov , 1993 .

[17]  W. van Peer,et al.  Quantitative studies of literature. A critique and an outlook , 1989, Comput. Humanit..

[18]  J. Rowe Equivocal Endings in Classic American Novels: The Scarlet Letter; Adventures of Huckleberry Finn; The Ambassadors; The Great Gatsby , 1988 .

[19]  T. D. Haen Linguistics and the study of literature , 1986 .

[20]  D. Forgacs,et al.  Modern Literary Theory: A Comparative Introduction , 1984 .

[21]  J. Gold The Morality of “Lolita” , 1960 .