The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation - (Best of the Labs Track at CLEF-2017)

Users posting online expect to remain anonymous unless they have logged in, which is often needed for them to be able to discuss freely on various topics. Preserving the anonymity of a text's writer can be also important in some other contexts, e.g., in the case of witness protection or anonymity programs. However, each person has his/her own style of writing, which can be analyzed using stylometry, and as a result, the true identity of the author of a piece of text can be revealed even if s/he has tried to hide it. Thus, it could be helpful to design automatic tools that can help a person obfuscate his/her identity when writing text. In particular, here we propose an approach that changes the text, so that it is pushed towards average values for some general stylometric characteristics, thus making the use of these characteristics less discriminative. The approach consists of three main steps: first, we calculate the values for some popular stylometric metrics that can indicate authorship; then we apply various transformations to the text, so that these metrics are adjusted towards the average level, while preserving the semantics and the soundness of the text; and finally, we add random noise. This approach turned out to be very efficient, and yielded the best performance on the Author Obfuscation task at the PAN-2016 competition.

[1]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[2]  Michael Gamon,et al.  Obfuscating Document Stylometry to Preserve Author Anonymity , 2006, ACL.

[3]  Chris Quirk,et al.  Monolingual Machine Translation for Paraphrase Generation , 2004, EMNLP.

[4]  H. T. Eddy The characteristic curves of composition. , 1887, Science.

[5]  Patrick Juola,et al.  Detecting Stylistic Deception , 2012 .

[6]  Chris Callison-Burch,et al.  PPDB: The Paraphrase Database , 2013, NAACL.

[7]  Rachel Greenstadt,et al.  Practical Attacks Against Authorship Recognition Techniques , 2009, IAAI.

[8]  Benno Stein,et al.  Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering , 2017, CLEF.

[9]  Gene Tsudik,et al.  Fighting authorship linkability with crowdsourcing , 2014, COSN '14.

[10]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[11]  Prasenjit Majumder,et al.  Author Masking through Translation , 2016, CLEF.

[12]  Matthias Hagen,et al.  Author Obfuscation: Attacking the State of the Art in Authorship Verification , 2016, CLEF.

[13]  Efstathios Stamatatos,et al.  Overview of the Author Identification Task at PAN 2013 , 2013, CLEF.

[14]  Jackie Chi Kit Cheung,et al.  Stylistic Transfer in Natural Language Generation Systems Using Recurrent Neural Networks , 2016 .

[15]  Reihaneh Safavi-Naini,et al.  Secure Obfuscation of Authoring Style , 2015, WISTP.

[16]  Douglas Bagnall,et al.  Author Identification Using Multi-headed Recurrent Neural Networks , 2015, CLEF.

[17]  Patrick Juola,et al.  Analyzing Stylometric Approaches to Author Obfuscation , 2011, IFIP Int. Conf. Digital Forensics.

[18]  Taher Rahgooy,et al.  Author Obfuscation using WordNet and Language Models , 2016, CLEF.

[19]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[20]  Rachel Greenstadt,et al.  Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity , 2012, TSEC.

[21]  Ariel Stolerman,et al.  Use Fewer Instances of the Letter "i": Toward Writing Style Anonymization , 2012, Privacy Enhancing Technologies.

[22]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..