Using Shakespeare's Sotto Voce to Determine True Identity From Text

Little is known of the private life of William Shakespeare, but he is famous for his collection of plays and poems, even though many of the works attributed to him were published anonymously. Determining the identity of Shakespeare has fascinated scholars for 400 years, and four significant figures in English literary history have been suggested as likely alternatives to Shakespeare for some disputed works: Bacon, de Vere, Stanley, and Marlowe. A myriad of computational and statistical tools and techniques have been used to determine the true authorship of his works. Many of these techniques rely on basic statistical correlations, word counts, collocated word groups, or keyword density, but no one method has been decided on. We suggest that an alternative technique that uses word semantics to draw on personality can provide an accurate profile of a person. To test this claim, we analyse the works of Shakespeare, Christopher Marlowe, and Elizabeth Cary. We use Word Accumulation Curves, Hierarchical Clustering overlays, Principal Component Analysis, and Linear Discriminant Analysis techniques in combination with RPAS, a multi-faceted text analysis approach that draws on a writer's personality, or self to identify subtle characteristics within a person's writing style. Here we find that RPAS can separate the known authored works of Shakespeare from Marlowe and Cary. Further, it separates their contested works, works suspected of being written by others. While few authorship identification techniques identify self from the way a person writes, we demonstrate that these stylistic characteristics are as applicable 400 years ago as they are today and have the potential to be used within cyberspace for law enforcement purposes.

[1]  B. Latham A Celebration of Women Writers , 2013 .

[2]  B. Vickers Shakespeare, 'A Lover's Complaint', and John Davies of Hereford , 2007 .

[3]  Thomas Merriam,et al.  Heterogeneous authorship in early Shakespeare and the problem of Henry V , 1998 .

[4]  George A. Miller,et al.  The science of words , 1991 .

[5]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[6]  MacDonald P. Jackson,et al.  Shakespeare and the Quarrel Scene in Arden of Faversham , 2006 .

[7]  Rob The Complete Works of William Shakespeare (Abridged) , 2013 .

[8]  W. Bucci The vocalization of painful affect. , 1982, Journal of communication disorders.

[9]  Kathy Charmaz,et al.  The Myth of Silent Authorship: Self, Substance, and Style in Ethnographic Writing , 1996 .

[10]  Danijela Kambasković-Sawers Three themes in one, which wondrous scope affords: Ambiguous Speaker and Storytelling in Shakespeare's Sonnets , 2008 .

[11]  Jörg Drechsler,et al.  Comparing Fully and Partially Synthetic Datasets for Statistical Disclosure Control in the German IAB Establishment Panel , 2008, Trans. Data Priv..

[12]  B. Efron,et al.  Estimating the number of unseen species: How many words did Shakespeare know? Biometrika 63 , 1976 .

[13]  B. Vickers Shakespeare and Authorship Studies in the Twenty-First Century , 2011 .

[14]  Narun Pornpattananangkul,et al.  Creativity and sensory gating indexed by the P50: Selective versus leaky sensory gating in divergent thinkers and creative achievers , 2015, Neuropsychologia.

[15]  Robert Matthews,et al.  Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher , 1993 .

[16]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[17]  Terttu Nevalainen,et al.  An Introduction to Early Modern English , 2006 .

[18]  P. McCullagh Estimating the Number of Unseen Species: How Many Words did Shakespeare Know? , 2008 .

[19]  J. Rudman Non-Traditional Authorship Attribution Studies of William Shakespeare’s Canon: Some Caveats , 2016 .

[20]  J. Pennebaker,et al.  The Secret Life of Pronouns , 2003, Psychological science.

[21]  E. F. Menhinick,et al.  A Comparison of Some Species‐Individuals Diversity Indices Applied to Samples of Field Insects , 1964 .

[22]  Yiming Yan,et al.  Surveying Stylometry Techniques and Applications , 2017, ACM Comput. Surv..

[24]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[25]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[26]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[27]  Ryan L. Boyd,et al.  Did Shakespeare Write Double Falsehood? Identifying Individuals by Creating Psychological Signatures With Text Analysis , 2015, Psychological science.

[28]  Santiago Segarra,et al.  Stylometric Analysis of Early Modern Period English Plays , 2016, Digit. Scholarsh. Humanit..

[29]  Georgios Kambourakis,et al.  Anonymity and closely related terms in the cyberspace: An analysis by example , 2014, J. Inf. Secur. Appl..

[30]  S. Morand,et al.  Comparative performance of species richness estimation methods , 1998, Parasitology.

[31]  Margot E. Kaminski Real Masks and Real Name Policies: Applying Anti-Mask Case Law to Anonymous Online Speech , 2013 .

[32]  Juhan Tuldava,et al.  The Development of Statistical Stylistics (A Survey) , 2004, J. Quant. Linguistics.

[33]  Refat Aljumily Hierarchical and Non-Hierarchical Linear and Non-Linear Clustering Methods to “Shakespeare Authorship Question” , 2015 .

[34]  Miroslav Kubat,et al.  Vocabulary Richness Measure in Genres , 2013, J. Quant. Linguistics.

[35]  N. Freedman,et al.  The language of depression. , 1981, Bulletin of the Menninger Clinic.

[36]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[37]  Tatiana Litvinova,et al.  Profiling a set of personality traits of text author: what our words reveal about us , 2016 .

[38]  Rajarathnam Chandramouli,et al.  Author gender identification from text , 2011, Digit. Investig..

[39]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[40]  Frances C. James,et al.  Relationships between Temperate Forest Bird Communities and Vegetation Structure , 1982 .

[41]  Michael D. Bristol Big-Time Shakespeare , 1996 .

[42]  T C Mendenhall,et al.  THE CHARACTERISTIC CURVES OF COMPOSITION. , 1887, Science.

[43]  Wilma Bucci,et al.  Building a Weighted Dictionary for Referential Activity , 2004 .

[44]  Rosa Lorés-Sanz The construction of the author's voice in academic writing: the interplay of cultural and disciplinary factors , 2011 .

[45]  Joshua K. Hartshorne,et al.  When Does Cognitive Functioning Peak? The Asynchronous Rise and Fall of Different Cognitive Abilities Across the Life Span , 2015, Psychological science.

[46]  S. Favaro,et al.  Estimating the number of unseen species under heavy tails , 2018 .

[47]  Joseph Rudman,et al.  The State of Authorship Attribution Studies: Some Problems and Solutions , 1997, Comput. Humanit..

[48]  Girish Keshav Palshikar Keyword Extraction from a Single Document Using Centrality Measures , 2007, PReMI.

[49]  Ryan L. Boyd,et al.  The Development and Psychometric Properties of LIWC2015 , 2015 .

[50]  Justin Zobel,et al.  Searching With Style: Authorship Attribution in Classic Literature , 2007, ACSC.

[51]  M. Roberts ADVENTURE IN ENGLISH , 1956 .

[52]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[53]  Stephen E. Fienberg,et al.  Statistical Disclosure Limitation for~Data~Access , 2018, Encyclopedia of Database Systems.

[54]  Max Coltheart,et al.  The MRC Psycholinguistic Database , 1981 .

[55]  Benjamin C. M. Fung,et al.  A unified data mining solution for authorship analysis in anonymous textual communications , 2013, Inf. Sci..

[56]  Paula Buttery,et al.  Zipf's law and the grammar of languages: A quantitative study of Old and Modern English parallel texts , 2014 .

[57]  Titus Andronicus,et al.  TITUS ANDRONICUS: , 2014 .

[58]  N. Leech,et al.  An Array of Qualitative Data Analysis Tools: A Call for Data Analysis Triangulation. , 2007 .

[59]  Sarah Steiner Gender, Genre, and Writing Style in Formal Written Texts , 2014 .

[60]  Dermot Lynott,et al.  Modality exclusivity norms for 423 object properties , 2009, Behavior research methods.

[61]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[62]  David N. Chin,et al.  Personality Profiling from Text: Introducing Part-of-Speech N-Grams , 2014, UMAP.

[63]  Brett D. Hirsch,et al.  "Mingled Yarn": The State of Computing in Shakespeare 2.0 , 2014 .

[64]  Robert J. Valenza,et al.  Was the Earl of Oxford the true Shakespeare , 1991 .

[65]  Joseph Rudman,et al.  The State of Non-Traditional Authorship Attribution Studies—2012: Some Problems and Solutions , 2012, DH.

[66]  Mark S. Seidenberg,et al.  Concept Representation Reflects Multimodal Abstraction: A Framework for Embodied Semantics. , 2016, Cerebral cortex.

[67]  Kye Taylor,et al.  An algorithm for automated authorship attribution using neural networks , 2008, Lit. Linguistic Comput..

[68]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[69]  Georg Northoff,et al.  Self-referential processing in our brain—A meta-analysis of imaging studies on the self , 2006, NeuroImage.

[70]  L. DeLisi,et al.  Language as a biomarker in those at high-risk for psychosis , 2015, Schizophrenia Research.

[71]  A. Vermeer Coming to grips with lexical richness in spontaneous speech data , 2000 .

[72]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[73]  Wilma Bucci,et al.  Linking words and things: Basic processes and individual variation , 1984, Cognition.

[74]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[75]  José Antonio Lozano,et al.  Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[76]  S. Horobin Studying the History of Early English , 2010 .

[77]  R. Harald Baayen,et al.  How Variable May a Constant be? Measures of Lexical Richness in Perspective , 1998, Comput. Humanit..

[78]  Diane Pecher,et al.  A sharp image or a sharp knife: norms for the modality-exclusivity of 774 concept-property items , 2010, Behavior research methods.

[79]  J. Pennebaker,et al.  Psychological aspects of natural language. use: our words, our selves. , 2003, Annual review of psychology.

[80]  Hugh Craig,et al.  Shakespeare, Computers, and the Mystery of Authorship: Plays in the corpus , 2009 .

[81]  O. Srinivasa Rao,et al.  Author Based Rank Vector Coordinates (ARVC) Model for Authorship Attribution , 2016 .

[82]  A. Chiarucci Estimating species richness: still a long way off! , 2012 .

[83]  Jieping Ye,et al.  Two-Dimensional Linear Discriminant Analysis , 2004, NIPS.

[84]  J. Burrows,et al.  Authors and Characters , 2012 .

[85]  Sebastian Hoffmann,et al.  Using the OED Quotations Database as a Corpus - a Linguistic Appraisal. , 2004 .

[86]  Hugh Craig,et al.  Shakespeare, Computers, and the Mystery of Authorship: Contents , 2009 .

[87]  Roger S. Brown,et al.  Politeness theory and Shakespeare's four major tragedies , 1989, Language in Society.

[88]  Mark Dredze,et al.  Separating Fact from Fear: Tracking Flu Infections on Twitter , 2013, NAACL.

[89]  Cindy K. Chung,et al.  Revealing Dimensions of Thinking in Open-Ended Self-Descriptions: An Automated Meaning Extraction Method for Natural Language. , 2008, Journal of research in personality.

[90]  Pablo Moscato,et al.  An Information Theoretic Clustering Approach for Unveiling Authorship Affinities in Shakespearean Era Plays and Poems , 2014, PloS one.

[91]  J. Hodges,et al.  The effects of very early Alzheimer's disease on the characteristics of writing by a renowned author. , 2004, Brain : a journal of neurology.