论文信息 - obfuscation using WordNet and language models Notebook for PAN at CLEF 2016

obfuscation using WordNet and language models Notebook for PAN at CLEF 2016

As almost all the successful author identification approaches are based on the word frequencies, the most obvious way to obfuscate a text is to distort those frequencies. In this paper we chose a subset of the most frequent words for an author and replace each one with one of their synonyms. In order to select the best synonym, we considered two measures: similarity of the original word and the synonym, the difference between the scores (probabilities) that are assigned to the original and distorted sentences by a language model. By using similarity, we aim to select words that are similar to the original word semantically, and by using a language model we try to favor word usages that are common.

Taher Rahgooy | Muharram Mansoorizadeh | Mohammad Aminiyan | Mahdy Eskandari

[1] Matthias Hagen,et al. Author Obfuscation: Attacking the State of the Art in Authorship Verification , 2016, CLEF.

[2] Malvina Nissim,et al. GLAD: Groningen Lightweight Authorship Detection , 2015, CLEF.

[3] Ewan Klein,et al. Natural Language Processing with Python , 2009 .

[4] Martha Palmer,et al. Verb Semantics and Lexical Selection , 1994, ACL.

[5] Kenneth Heafield,et al. KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[6] Frankie James,et al. Modified Kneser-Ney Smoothing of n-gram Models , 2000 .

[7] Benno Stein,et al. Improving the Reproducibility of PAN's Shared Tasks: - Plagiarism Detection, Author Identification, and Author Profiling , 2014, CLEF.

[8] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.