Literary Detective Work on the Computer

Computational linguistics can be used to uncover mysteries in text which are not always obvious to visual inspection. For example, the computer analysis of writing style can show who might be the true author of a text in cases of disputed authorship or suspected plagiarism. The theoretical background to authorship attribution is presented in a step by step manner, and comprehensive reviews of the field are given in two specialist areas, the writings of William Shakespeare and his contemporaries, and the various writing styles seen in religious texts. The final chapter looks at the progress computers have made in the decipherment of lost languages. This book is written for students and researchers of general linguistics, computational and corpus linguistics, and computer forensics. It will inspire future researchers to study these topics for themselves, and gives sufficient details of the methods and resources to get them started.

[1]  Benno Stein,et al.  Recent Trends in Digital Text Forensics and Its Evaluation - Plagiarism Detection, Author Identification, and Author Profiling , 2013, CLEF.

[2]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[3]  Patrick Juola,et al.  A Controlled-corpus Experiment in Authorship Identification by Cross-entropy , 2003 .

[4]  Behnam Sadeghi,et al.  The Chronology of the Qurān: A Stylometric Research Program , 2011 .

[5]  Catherine Orliac,et al.  The Rongorongo tablets from Easter Island: botanical identification and 14C dating , 2005 .

[6]  Moni Naor,et al.  Pricing via Processing or Combatting Junk Mail , 1992, CRYPTO.

[7]  Eric Johnson,et al.  The Density of Latinate Words in the Speeches of Jane Austen's Characters , 2001, Lit. Linguistic Comput..

[8]  Hugo T. Jankowitz Detecting Plagiarism in Student Pascal Programs , 1988, Comput. J..

[9]  Ward E. Y. Elliott,et al.  And then there were none: Winnowing the Shakespeare claimants , 1996, Comput. Humanit..

[10]  Alexander F. Gelbukh,et al.  Zipf and Heaps Laws' Coefficients Depend on Language , 2001, CICLing.

[11]  Jose Nilo G. Binongo,et al.  The application of principal component analysis to stylometry , 1999 .

[12]  David Heckerman,et al.  Fighting spam with statistics , 2004 .

[13]  Hermann Moisl,et al.  Sura Length and Lexical Probability Estimation in Cluster Analysis of the Qur’an , 2009, TALIP.

[14]  Fazli Can,et al.  Change of Writing Style with Time , 2004, Comput. Humanit..

[15]  Yuen-Yan Chan,et al.  A natural language processing approach to automatic plagiarism detection , 2007, SIGITE '07.

[16]  Thomas P. Way,et al.  SNITCH: a software tool for detecting cut and paste plagiarism , 2006, SIGCSE '06.

[17]  W. Lamb The Storyteller, the Scribe, and a Missing Man: Hidden Influences from Printed Sources in the Gaelic Tales of Duncan and Neil MacDonald , 2012 .

[18]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[19]  D. Holmes The Analysis of Literary Style — a Review , 1985 .

[20]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[21]  J. Pennebaker,et al.  PERSONALITY PROCESSES AND INDIVIDUAL DIFFERENCES Words of Wisdom: Language Use Over the Life Span , 2003 .

[22]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[23]  Cyril Labbé,et al.  A Tool for Literary Studies: Intertextual Distance and Tree Classification , 2005, Lit. Linguistic Comput..

[24]  Thomas Merriam,et al.  Shakespeare, Fletcher, and the Two Noble Kinsmen , 1994 .

[25]  Stefan Th. Gries,et al.  Collostructions: Investigating the interaction of words and constructions , 2003 .

[26]  T. V. N. Merriam Marlowe’s Hand in Edward III , 1993 .

[27]  Benno Stein,et al.  Cross-language plagiarism detection , 2011, Lang. Resour. Evaluation.

[28]  John Burrows,et al.  All the Way Through: Testing for Authorship in Different Frequency Strata , 2007, Lit. Linguistic Comput..

[29]  Silvia Bernardini,et al.  A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text , 2005, Lit. Linguistic Comput..

[30]  Caxton C. Foster A Comparison of vowel Identification Methods , 1992, Cryptologia.

[31]  Thomas Merriam Marlowe’s Hand in Edward III Revisited , 1996 .

[32]  W. Bruce Croft,et al.  Similarity measures for tracking information flow , 2005, CIKM '05.

[33]  A. Q. Morton,et al.  Once. A test of authorship based on words which are not repeated in the sample , 1986 .

[34]  D. Holmes A Stylometric Analysis of Mormon Scripture and Related Texts , 1992 .

[35]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .

[36]  Graeme Hirst,et al.  Longitudinal detection of dementia through lexical and syntactic changes in writing: a case study of three British novelists , 2011, Lit. Linguistic Comput..

[37]  Bei Yu,et al.  An evaluation of text classification methods for literary study , 2008, Lit. Linguistic Comput..

[38]  David L. Hoover,et al.  Frequent Collocations and Authorial Style , 2003, Lit. Linguistic Comput..

[39]  Rajesh P. N. Rao,et al.  Entropic Evidence for Linguistic Structure in the Indus Script , 2009, Science.

[40]  H. H. Greenwood St Paul Revisited—a Computational Result , 1992 .

[41]  A. J. M. Linmans Correspondence analysis of the synoptic gospels , 1998 .

[42]  Chris J. Park,et al.  Rebels without a clause: towards an institutional framework for dealing with plagiarism by students , 2004 .

[43]  C. Whissell Using the Revised Dictionary of Affect in Language to Quantify the Emotional Undertones of Samples of Natural Language , 2009, Psychological reports.

[44]  F. Taylor,et al.  Cryptomnesia and Plagiarism , 1965, British Journal of Psychiatry.

[45]  M. Berryman,et al.  Recent Advances in Computational Linguistics and their Application to Biblical Studies , 2008, New Testament Studies.

[46]  Jacques Savoy,et al.  Authorship attribution based on a probabilistic topic model , 2013, Inf. Process. Manag..

[47]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[48]  J. F. Burrows,et al.  Not Unles You Ask Nicely: The Interpretative Nexus Between Analysis and Information , 1992 .

[49]  Jacques B. M. Guy Vowel Identification: an Old (but Good) Algorithm , 1991, Cryptologia.

[50]  Cynthia Whissell,et al.  Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon , 1996, Comput. Humanit..

[51]  Søren Wichmann,et al.  On the power-law distribution of language family sizes , 2005, Journal of Linguistics.

[52]  Hans van Halteren,et al.  New Machine Learning Methods Demonstrate the Existence of a Human Stylome , 2005, J. Quant. Linguistics.

[53]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[54]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[55]  Gerard Ledger,et al.  An Exploration of Differences in the Pauline Epistles using Multivariate Statistical Analysis , 1995 .

[56]  H. H. Greenwood St Paul Revisited—Word Clusters in Multidimensional Space , 1993 .

[57]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[58]  Eric J. Beh,et al.  Simple Correspondence Analysis: A Bibliographic Review , 2004 .

[59]  David L. Hoover Frequent Word Sequences and Statistical Stylistics , 2002, Lit. Linguistic Comput..

[60]  Subhash C. Kak The Study of the Indus Script General Considerations , 1987, Cryptologia.

[61]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[62]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[63]  Richard Sproat,et al.  Last Words: Ancient Symbols, Computational Linguistics, and the Reviewing Practices of the General Science Journals , 2010, CL.

[64]  Matthew L. Jockers Testing Authorship in the Personal Writings of Joseph Smith Using NSC Classification , 2013, Lit. Linguistic Comput..

[65]  Pauline Ziman,et al.  Pictish symbols revealed as a written language through application of Shannon entropy , 2010, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[66]  Bradley Kjell,et al.  Authorship Determination Using Letter Pair Frequency Features with Neural Network Classifiers , 1995 .

[67]  Azriel Rosenfeld,et al.  Breaking substitution ciphers using a relaxation algorithm , 1979, CACM.

[68]  P. Kirwan The First Collected "Shakespeare Apocrypha" , 2011 .

[69]  D. Mealand Is there Stylometric Evidence for Q? , 2011, New Testament Studies.

[70]  B. Efron,et al.  Did Shakespeare write a newly-discovered poem? , 1987 .

[71]  Pedro Carpena,et al.  Keyword detection in natural languages and DNA , 2002 .

[72]  Daniel Shawcross Wilkerson,et al.  Winnowing: local algorithms for document fingerprinting , 2003, SIGMOD '03.

[73]  P.G.M. Van der Heijden,et al.  A Combined Approach to Contingency Table Analysis Using Correspondence Analysis and Log-Linear Analysis , 1989 .

[74]  Tomi S. Melka,et al.  The Rongorongo Script: On a Listed Sequence in the Recto of Tablet “Mamari”. Part II , 2011, J. Quant. Linguistics.

[75]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[76]  Cyril Labbé,et al.  Inter-Textual Distance and Authorship Attribution Corneille and Molière , 2001, J. Quant. Linguistics.

[77]  R. Forsyth Stylochronometry with substrings, or : a poet young and old , 1999 .

[78]  Naglaa Thabet Understanding the Thematic Structure of the Qur'an: An Exploratory Multivariate Approach , 2005, ACL.

[79]  José Nilo G. Binongo,et al.  Who Wrote the 15th Book of Oz? An Application of Multivariate Analysis to Authorship Attribution , 2003 .

[80]  Thomas Merriam King John Divided , 2004, Lit. Linguistic Comput..

[81]  Justin Zobel,et al.  Methods for Identifying Versioned and Plagiarized Documents , 2003, J. Assoc. Inf. Sci. Technol..

[82]  J. Binongo,et al.  A bridge between statistics and literature: The graphs of Oscar Wilde's literary genres , 1999 .

[83]  Tomi S. Melka Structural Observations Regarding RongoRongo Tablet ‘Keiti’ , 2008, Cryptologia.

[84]  Claude E. Shannon,et al.  Communication theory of secrecy systems , 1949, Bell Syst. Tech. J..

[85]  Christopher J. Howe,et al.  Dante's Monarchia as a test case for the use of phylogenetic methods in stemmatic analysis , 2008, Lit. Linguistic Comput..

[86]  The Authorship of Pericles: New Evidence for Wilkins , 1987 .

[87]  Wentian Li,et al.  Random texts exhibit Zipf's-law-like word frequency distribution , 1992, IEEE Trans. Inf. Theory.

[88]  David I. Holmes,et al.  An Assessment of Cumulative Sum Charts for Authorship Attribution , 1993 .

[89]  Patrick Brennan,et al.  A Prototype for Authorship Attribution Studies , 2006, Lit. Linguistic Comput..

[90]  John F. Dooley,et al.  Who Wrote The Blonde Countess? A Stylometric Analysis of Herbert O. Yardley's Fiction , 2009, Cryptologia.

[91]  Rajesh P. N. Rao,et al.  Commentary and Discussion: Entropy, the Indus Script, and Language: A Reply to R. Sproat , 2010, CL.

[92]  David I. Holmes,et al.  The diary of a public man: a case study in traditional and non-traditional authorship attribution , 2010, Lit. Linguistic Comput..

[93]  H. van Halteren,et al.  Outside the cave of shadows: using syntactic annotation to enhance authorship attribution , 1996 .

[94]  Rajesh P. N. Rao,et al.  Statistical Analysis of the Indus Script Using n-Grams , 2009, PloS one.

[95]  Gregory L. Snow,et al.  Extended nearest shrunken centroid classification: A new method for open-set authorship attribution of texts of varying sizes , 2011, Lit. Linguistic Comput..

[96]  Trevor Hastie,et al.  Regularized linear discriminant analysis and its application in microarrays. , 2007, Biostatistics.

[97]  J. Hodges,et al.  The effects of very early Alzheimer's disease on the characteristics of writing by a renowned author. , 2004, Brain : a journal of neurology.

[98]  D. Huson,et al.  Application of phylogenetic networks in evolutionary studies. , 2006, Molecular biology and evolution.

[99]  David I. Holmes,et al.  Vocabulary Richness and the Prophetic Voice , 1991 .

[100]  D. L. Mealand Correspondence Analysis of Luke , 1995 .

[101]  D. Mealand Style, genre, and authorship in Acts, the Septuagint, and Hellenistic historians , 1999 .

[102]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[103]  J. Poirier Statistical Studies of the Verbal Agreements and their Impact on the Synoptic Problem * , 2008 .

[104]  Alberto Barrón-Cedeño,et al.  A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[105]  R. Voss,et al.  Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. , 1992, Physical review letters.

[106]  D. Holmes,et al.  The Provenance of De Doctrina Christiana, attributed to John Milton: A Statistical Investigation , 1998 .

[107]  Konstantin Pozdniakov Les bases du déchiffrement de l'écriture de l'île de Pâques , 1996 .

[108]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[109]  Jan de Leeuw,et al.  Correspondence analysis used complementary to loglinear analysis , 1985 .

[110]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[111]  M. Montemurro,et al.  Universal Entropy of Word Ordering Across Linguistic Families , 2011, PloS one.

[112]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[113]  Alan T. Sherman,et al.  Statistical Techniques for Language Recognition: an Introduction and Guide for Cryptanalysts , 1993, Cryptologia.

[114]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[115]  Raj Kumar Pan,et al.  Network analysis of a corpus of undeciphered Indus civilization inscriptions indicates syntactic organization , 2011, Comput. Speech Lang..

[116]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[117]  Efstathios Stamatatos,et al.  Computer-Based Authorship Attribution Without Lexical Measures , 2001, Comput. Humanit..

[118]  Shlomo Argamon,et al.  Interpreting Burrows's Delta: Geometric and Probabilistic Foundations , 2007, Lit. Linguistic Comput..

[119]  Mats Dahllöf Automatic prediction of gender, political affiliation, and age in Swedish politicians from the wording of their speeches - A comparative study of classifiability , 2012, Lit. Linguistic Comput..

[120]  Matthew Spencer,et al.  Estimating Distances between Manuscripts Based on Copying Errors , 2001, Lit. Linguistic Comput..

[121]  M. W. A. Smith The Authorship of The Raigne of King Edward the Third , 1991 .

[122]  Matthew L. Jockers,et al.  Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification , 2008, Lit. Linguistic Comput..

[123]  Matthew L. Jockers,et al.  A comparative study of machine learning methods for authorship attribution , 2010, Lit. Linguistic Comput..

[124]  Gabriel Landini,et al.  EVIDENCE OF LINGUISTIC STRUCTURE IN THE VOYNICH MANUSCRIPT USING SPECTRAL ANALYSIS , 2001, Cryptologia.

[125]  Dieter Rumpel Some Quantitative Evaluations of the Disk of Phaistos Text , 1994, J. Quant. Linguistics.

[126]  Constantina Stamou,et al.  Stylochronometry: Stylistic Development, Sequence of Composition, and Relative Dating , 2007, Lit. Linguistic Comput..

[127]  Joris van Zundert,et al.  Delta for Middle Dutch - Author and Copyist Distinction in Walewein , 2007, Lit. Linguistic Comput..

[128]  Christine Wilson,et al.  A Widow and her Soldier: Stylometry and the American Civil War , 2001, Lit. Linguistic Comput..

[129]  Hwan-Gue Cho,et al.  A detecting and tracing algorithm for unauthorized internet-news plagiarism using spatio-temporal document evolution model , 2009, SAC '09.

[130]  Kyle Mahowald A Naïve Bayes classifier for Shakespeare's second-person pronoun , 2012, Lit. Linguistic Comput..

[131]  H. H. Greenwood Common Word Frequencies and Authorship in Luke's Gospel and Acts , 1995 .

[132]  David L. Hoover,et al.  An exercise in non-ideal authorship attribution: the mysterious Maria Ward , 2009, Lit. Linguistic Comput..

[133]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[134]  Diego Antonio Rodríguez Torrejón,et al.  Detailed Comparison Module in CoReMo 1.9 Plagiarism Detector , 2012, CLEF.

[135]  Tuomas Heikkilä,et al.  Evaluating methods for computer-assisted stemmatology using artificial benchmark data sets , 2009, Lit. Linguistic Comput..