Identifying Sexual Predators by SVM Classification with Lexical and Behavioral Features

We identify sexual predators in a large corpus of web chats using SVM classification with a bag-of-words model over unigrams and bigrams. We find this simple lexical approach to be quite effective with an F1 score of 0.77 over a 0.003 baseline. By also encoding the language used by an author’s partners and some small heuristics, we boost performance to an F1 score of 0.83. We identify the most “predatory” messages by calculating a score for each message equal to the average of the weights of the n-grams therein, as determined by a linear SVM model. We boost performance with a manually constructed “blacklist”.