Real-Time Twitter Content Polluter Detection Based on Direct Features

Too many content polluters on social networks make it difficult for users to browse valuable contents. Some research has been done in spam and phishing detection on social networks but these are only a small part of all content polluters. What bother users most are those large amount of repeated low quality advertisements. Hence it is necessary to filter these content polluters to improve users' experiences. Moreover, most of the phishing/spam detection works are done offline and some of the features used take too much time to extract making it impossible for real-time detection. We perform a study on an extensive twitter dataset and present a definition of content polluters. We further propose some novel features and together with other commonly used features in phishing/spam detection, we classify them into two categories - direct features and indirect features. A simple random forest classifier is applied based on our proposed direct features alone for real-time content polluter detection and it achieves a reasonable high accuracy with high F1 values.