Incremental information gain analysis of input attribute impact on RBF-kernel SVM spam detection

The massive increase of spam is posing a very serious threat to email and SMS, which have become an important means of communication. Not only do spams annoy users, but they also become a security threat. Machine learning techniques have been widely used for spam detection. Email spams can be detected through detecting senders' behaviour, the contents of an email, subject and source address, etc, while SMS spam detection usually is based on the tokens or features of messages due to short content. However, a comprehensive analysis of email/SMS content may provide cures for users to aware of email/SMS spams. We cannot completely depend on automatic tools to identify all spams. In this paper, we propose an analysis approach based on information entropy and incremental learning to see how various features affect the performance of an RBF-based SVM spam detector, so that to increase our awareness of a spam by sensing the features of a spam. The experiments were carried out on the spambase and SMSSpemCollection databases in UCI machine learning repository. The results show that some features have significant impacts on spam detection, of which users should be aware, and there exists a feature space that achieves Pareto efficiency in True Positive Rate and True Negative Rate.

[1]  Bing Liu,et al.  Review spam detection , 2007, WWW '07.

[2]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[3]  Eric Brill,et al.  Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[4]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[5]  Arkaitz Zubiaga,et al.  Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter , 2015, #MSM.

[6]  Rashmi Raj,et al.  Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[7]  Akebo Yamakami,et al.  An Analysis of Machine Learning Methods for Spam Host Detection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[8]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[9]  Eugene Fink,et al.  Detection of Internet scam using logistic regression , 2011, 2011 IEEE International Conference on Systems, Man, and Cybernetics.

[10]  Konstantin Tretyakov,et al.  Machine Learning Techniques in Spam Filtering , 2004 .

[11]  Hossam Faris,et al.  Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution , 2015 .

[12]  András A. Benczúr,et al.  SpamRank - fully automatic link spam detection. Work in progress , 2005 .

[13]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[14]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[15]  Padmini Srinivasan,et al.  Spam detection in online classified advertisements , 2011, WebQuality '11.

[16]  Taghi M. Khoshgoftaar,et al.  Survey of review spam detection using machine learning techniques , 2015, Journal of Big Data.

[17]  Sanjeev Dhawan,et al.  Detection of Spam in Social Networks using Clustered k- Nearest Neighbour , 2015 .