On Identifying Authors with Style

Stylometry is the quantified (often statistical) analysis of author style as a set of (usually morphosyntactic) features expressed in several documents by the author. The focus of this paper is a task to which stylometry is often applied: authorship attribution, the question of identifying or confirming the author of a text based on the known body of work. We analyze a feature set previously introduced in the field, using a tool and corpus already available. Decomposing the set, we identify the features that seem to have contributed the most to accurate performance. In re-composing the set under different objectives - first, for English-only document sets, and then for possible multi-language use - we identify smaller sets of feature combinations that work well together in accurate performance. We then outline our continuing work based on the results we obtain.

[1]  Rachel Greenstadt,et al.  Translate Once, Translate Twice, Translate Thrice and Attribute: Identifying Authors and Machine Translation Tools in Translated Text , 2012, 2012 IEEE Sixth International Conference on Semantic Computing.

[2]  Kye Taylor,et al.  An algorithm for automated authorship attribution using neural networks , 2008, Lit. Linguistic Comput..

[3]  Rachel Greenstadt,et al.  Practical Attacks Against Authorship Recognition Techniques , 2009, IAAI.

[4]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[5]  Naomie Salim,et al.  Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[6]  Mark Steyvers,et al.  Detecting authorship deception: a supervised machine learning approach using author writeprints , 2012, Lit. Linguistic Comput..

[7]  Ophir Frieder,et al.  Repeatable evaluation of search services in dynamic environments , 2007, TOIS.

[8]  Robert Matthews,et al.  Neural Computation in Stylometry I: An Application to the Works of Shakespeare and Fletcher , 1993 .

[9]  Michael Gamon,et al.  Obfuscating Document Stylometry to Preserve Author Anonymity , 2006, ACL.

[10]  Robert Goodman,et al.  The Use of Stylometry for Email Author Identification: A Feasibility Study , 2007 .

[11]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[12]  Moshe Koppel,et al.  Authorship verification as a one-class classification problem , 2004, ICML.

[13]  Patrick Juola,et al.  Detecting Stylistic Deception , 2012 .

[14]  Walter Daelemans,et al.  The effect of author set size and data size in authorship attribution , 2011, Lit. Linguistic Comput..

[15]  Frederick Mosteller,et al.  Applied Bayesian and classical inference : the case of the Federalist papers , 1984 .

[16]  Alexander Clark Forensic Stylometric Authorship Analysis Under the Daubert Standard , 2011 .

[17]  David I. Holmes,et al.  Who Was the Author? An Introduction to Stylometry , 2003 .

[18]  Stefan Gruner,et al.  Tool support for plagiarism detection in text documents , 2005, SAC '05.

[19]  Dawn Xiaodong Song,et al.  On the Feasibility of Internet-Scale Author Identification , 2012, 2012 IEEE Symposium on Security and Privacy.

[20]  David I. Holmes,et al.  Neural network applications in stylometry: The Federalist Papers , 1996, Comput. Humanit..

[21]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[22]  Lauren M. Stuart,et al.  Style Features for Authors in Two Languages , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[23]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[24]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[25]  Pankaj Rohatgi,et al.  Can Pseudonymity Really Guarantee Privacy? , 2000, USENIX Security Symposium.

[26]  Maciej Eder,et al.  Do birds of a feather really flock together, or how to choose training samples for authorship attribution , 2013, Lit. Linguistic Comput..

[27]  Rachel Greenstadt,et al.  Detecting Hoaxes, Frauds, and Deception in Writing Style Online , 2012, 2012 IEEE Symposium on Security and Privacy.