Identifying idiolect in forensic authorship attribution: an n-gram textbite approach

Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist “approaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. . . ], their own idiolect” (Coulthard, 2004: 31). However, given the difficulty in empirically substantiating a theory of idiolect, there is growing concern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite oUers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ‘n-gram textbites’, small textual segments that characterise that author’s writing, providing DNA-like chunks of identifying material? Drawing on a corpus of 63,000 emails and 2.5 million words written by 176 employees of the former American energy corporation Enron, a case study approach is adopted, Vrst showing through stylistic analysis that one Enron employee repeatedly produces the same stylistic patterns of politely encoded directives in a way that may be considered habitual. Then a statistical experiment with the same case study author Vnds that word n-grams can assign anonymised email samples to him with success rates as high as 100%. This paper argues that, if suXciently distinctive, these textbites are able to identify authors by reducing a mass of data to key segments that move us closer to the elusive concept of idiolect.

[1]  Tim Grant Approaching questions in forensic authorship analysis , 2008 .

[2]  M. Economidou-Kogetsidis Please answer me as soon as possible: Pragmatic failure in non-native speakers e-mail requests to , 2011 .

[3]  Ray Bull,et al.  Incorporating Context in Linking Crimes: An Exploratory Study of Situational Similarity and If-Then Contingencies , 2008 .

[4]  Pavla Chejnová Expressing politeness in the institutional e-mail communications of university students in the Czech Republic , 2014 .

[5]  Janet Cotterill How to use corpus linguistics in forensic linguistics , 2010 .

[6]  Patricia Bou-Franch,et al.  Openings and closings in Spanish email conversations , 2011 .

[7]  Fintan Culwin,et al.  Optimising and Automating the Choice of Search Strings when Investigating Possible Plagiarism , 2010 .

[8]  Monica Ancu From soundbite to textbite: Election 2008 comments on Twitter. , 2010 .

[9]  Shlomo Hershkop,et al.  Automated social hierarchy detection through email network analysis , 2007, WebKDD/SNA-KDD '07.

[10]  Efstathios Stamatatos,et al.  On the Robustness of Authorship Attribution Based on Character N -gram Features , 2013 .

[11]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[12]  Krzysztof Kredens,et al.  Corpus Linguistics In Authorship Identification , 2012 .

[13]  Samuel Larner,et al.  A preliminary investigation into the use of fixed formulaic sequences as a marker of authorship , 2014 .

[14]  P. Juola Stylometry and Immigration: A Case Study , 2013 .

[15]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[16]  Graeme Hirst,et al.  Bigrams of Syntactic Labels for Authorship Discrimination of Short Texts , 2007, Lit. Linguistic Comput..

[17]  Mike Scott Wordsmith Tools version 3 , 1997 .

[18]  Paolo Rosso,et al.  Authorship Attribution Using Word Sequences , 2006, CIARP.

[19]  Jacques Savoy,et al.  Authorship Attribution: A Comparative Study of Three Text Corpora and Three Languages , 2012, J. Quant. Linguistics.

[20]  M. Hoey Lexical Priming: A New Theory of Words and Language , 2005 .

[21]  David L. Hoover,et al.  Testing Burrows's Delta , 2004, Lit. Linguistic Comput..

[22]  John R. Searle,et al.  Speech Acts: An Essay in the Philosophy of Language , 1970 .

[23]  Tim Grant,et al.  Quantifying evidence in forensic authorship analysis , 2007 .

[24]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[25]  Shlomo Argamon,et al.  Authorship attribution in the wild , 2010, Lang. Resour. Evaluation.

[26]  Mark S. Manasse,et al.  On the Efficient Determination of Most Near Neighbors: Horseshoes, Hand Grenades, Web Search and Other Situations When Close is Close Enough , 2012, On the Efficient Determination of Most Near Neighbors.

[27]  Sandra Mollin,et al.  I entirely understand is a Blairism: The methodology of identifying idiolectal collocations , 2009 .

[28]  Kevin J. Gaston,et al.  Patterns of plant beta‐diversity along elevational and latitudinal gradients in mountain forests of China , 2012 .

[29]  M. Coulthard,et al.  On the use of corpora in the analysis of forensic texts , 2013 .

[30]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[31]  Csr Young,et al.  How to Do Things With Words , 2009 .

[32]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[33]  Peter Tiersma,et al.  Author Identification in American Courts , 2004 .

[34]  Walter Daelemans,et al.  The effect of author set size and data size in authorship attribution , 2011, Lit. Linguistic Comput..

[35]  Shlomo Argamon,et al.  Authorship attribution with thousands of candidate authors , 2006, SIGIR.

[36]  David J. Marchette,et al.  Scan Statistics on Enron Graphs , 2005, Comput. Math. Organ. Theory.

[37]  Jessica Woodhams,et al.  Linking serial residential burglary: comparing the utility of modus operandi behaviours, geographical proximity, and temporal proximity , 2010 .

[38]  David L. Hoover,et al.  Frequent Collocations and Authorial Style , 2003, Lit. Linguistic Comput..

[39]  C. Goddard Words and Phrases: Corpus Studies of Lexical Semantics , 2006 .

[40]  A. Price,et al.  Measuring b-diversity using a taxonomic similarity index, and its relation to spatial scale , 2001 .

[41]  David Wright,et al.  Stylistic variation within genre conventions in the Enron email corpus: developing a textsensitive methodology for authorship research , 2013 .

[42]  Cécile Paris,et al.  The nature of requests and commitments in email messages , 2008, AAAI 2008.

[43]  David Woolls,et al.  Who wrote this? The linguist as detective , 2009 .

[44]  Stefan Th. Gries,et al.  50-something years of work on collocations: What is or should be next … , 2013 .

[45]  Shlomo Argamon,et al.  The Rest of the Story: Finding Meaning in Stylistic Variation , 2010, The Structure of Style.

[46]  Lynne Flowerdew,et al.  The argument for using English specialised corpora to understand academic and professional language. , 2004 .

[47]  Christine Nadel,et al.  Case Study Research Design And Methods , 2016 .

[48]  Nuria Lorenzo-Dus,et al.  Natural versus elicited data in cross-cultural speech act realisation: The case of requests in Peninsular Spanish and British English , 2008 .

[49]  Michael Pace-Sigge Lexical Priming in Spoken English Usage , 2013 .

[50]  Carole E. Chaski,et al.  Empirical evaluations of language-based author identification techniques , 2001 .

[51]  H. Love Attributing Authorship: An Introduction , 2002 .

[52]  Kirk W. Duthler The Politeness of Requests Made Via Email and Voicemail: Support for the Hyperpersonal Model , 2006, J. Comput. Mediat. Commun..

[53]  Ingrid Pufahl Bax How to assign work in an office: A comparison of spoken and written directives in American english , 1986 .

[54]  Li Lan Email: a challenge to Standard English? , 2000, English Today.

[55]  Simon Günter,et al.  Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation , 2006, EMNLP.

[56]  Joan Waldvogel,et al.  Greetings and Closings in Workplace Email , 2007, J. Comput. Mediat. Commun..

[57]  Lawrence M. Solan Intuition versus Algorithm: The Case of Forensic Authorship Attribution , 2013 .

[58]  David L. Hoover Frequent Word Sequences and Statistical Stylistics , 2002, Lit. Linguistic Comput..

[59]  Shlomo Argamon,et al.  A Systemic Functional Approach To Automated Authorship Analysis , 2013 .

[60]  John Olsson,et al.  Forensic linguistics , 1997, English Today.

[61]  Hans van Halteren,et al.  New Machine Learning Methods Demonstrate the Existence of a Human Stylome , 2005, J. Quant. Linguistics.

[62]  Antonio Rico-Sulayes,et al.  Statistical Authorship Attribution of Mexican Drug Trafficking Online Forum Posts , 2011 .

[63]  G. Mazzoleni,et al.  "Mediatization" of Politics: A Challenge for Democracy? , 1999 .

[64]  J. Sherblom Direction, Function, and Signature in Electronic Mail , 1988 .

[65]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[66]  Tim Grant,et al.  Bridging the gap between stylistic and cognitive approaches to authorship analysis using Systemic Functional Linguistics and multidimensional analysis , 2013 .

[67]  Efstathios Stamatatos,et al.  Author identification: Using text sampling to handle the class imbalance problem , 2008, Inf. Process. Manag..

[68]  Tim D. Grant TXT 4N6:method, consistency, and distinctiveness in the analysis of sms text messages , 2013 .

[69]  Michael Haugh,et al.  Getting stuff done: Comparing e-mail requests from students in higher education in Britain and Australia , 2012 .

[70]  Alison Wray Formulaic Language and the Lexicon: Formulaic Language and the Lexicon , 2002 .

[71]  John Sinclair,et al.  Corpus, Concordance, Collocation , 1991 .

[72]  Nor Fariza Mohd Nor,et al.  Politeness In E-mails Of Arab Students In Malaysia , 2012 .

[73]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[74]  David D. Clarke,et al.  Climate in the News: How Differences in Media Discourse Between the US and UK Reflect National Priorities , 2012 .

[75]  Alison Wray,et al.  Formulaic Language: Pushing the Boundaries , 2008 .

[76]  Lynda Lee Kaid,et al.  Techno politics in presidential campaigning : new voices, new technologies, and new voters , 2011 .

[77]  Naomi S. Baron Letters by Phone or Speech by Other Means: The Linguistics of Email. , 1998 .

[78]  Stefan Siersdorfer,et al.  Efficient jaccard-based diversity analysis of large document collections , 2012, CIKM.

[79]  Jonathan Gains Electronic Mail--A New Style of Communication or Just a New Medium? An Investigation into the Text Features of E-Mail. , 1999 .

[80]  Nadine Van den Eynden,et al.  Politeness and gender in Belgian organisational emails , 2012 .

[81]  Svenja Adolphs,et al.  Are corpus-derived recurrent clusters psycholinguistically valid? , 2004 .

[82]  M. Barlow Individual usage : a corpus-based study of idiolects , 2010 .

[83]  Craig Bennell,et al.  Between a ROC and a hard place: a method for linking serial burglaries by modus operandi , 2005 .

[84]  M. Coulthard On Admissible Linguistic Evidence , 2013 .

[85]  J. Knox Visual-verbal communication on online newspaper home pages , 2007 .

[86]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[87]  J. Luchjenbroers,et al.  Paedophiles and politeness in email communications: Community of practice needs that define face-threat , 2011 .

[88]  M. T. Turell The use of textual, grammatical and sociolinguistic evidence in forensic text comparison: , 2011 .

[89]  Julio Gimenez,et al.  Business e-mail communication: some emerging tendencies in register , 2000 .

[90]  Tim D. Grant,et al.  Identifying reliable, valid markers of authorship: a response to Chaski , 2001 .

[91]  Dominique Labbé,et al.  Experiments on authorship attribution by intertextual distance in english* , 2007, J. Quant. Linguistics.

[92]  Felix Naumann,et al.  An Introduction to Duplicate Detection , 2010, An Introduction to Duplicate Detection.

[93]  Shlomo Argamon,et al.  Authorship Attribution: What's Easy and What's Hard? , 2013 .

[94]  James R. Nattinger,et al.  Lexical Phrases and Language Teaching , 1992 .

[95]  Shlomo Argamon,et al.  Computational methods in authorship attribution , 2009 .

[96]  John Burrows,et al.  'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship , 2002, Lit. Linguistic Comput..

[97]  Antoine Guisan,et al.  The accuracy of plant assemblage prediction from species distribution models varies along environmental gradients , 2013 .

[98]  M. Coulthard Author Identification, Idiolect, and Linguistic Uniqueness. , 2004 .

[99]  John Burrows Andrew Marvell and the 'painter satires': a computational approach to their authorship | NOVA. The University of Newcastle's Digital Repository , 2005 .