obfuscation using WordNet and language models Notebook for PAN at CLEF 2016

As almost all the successful author identification approaches are based on the word frequencies, the most obvious way to obfuscate a text is to distort those frequencies. In this paper we chose a subset of the most frequent words for an author and replace each one with one of their synonyms. In order to select the best synonym, we considered two measures: similarity of the original word and the synonym, the difference between the scores (probabilities) that are assigned to the original and distorted sentences by a language model. By using similarity, we aim to select words that are similar to the original word semantically, and by using a language model we try to favor word usages that are common.