Blocking spam by separating end-user machines from legitimate mail server machines

Spamming botnets present a critical challenge in the control of spam messages because of the sheer volume and wide spread of the botnet members. In this paper, we advocate the approach for recipient mail servers to filter messages directly delivered from remote end-user EU machines, given that the majority of spamming bots are EU machines. We develop a support vector machine SVM-based classifier to separate EU machines from legitimate mail server LMS machines, using a set of machine features that cannot be easily manipulated by spammers. We investigate the efficacy and performance of the SVM-based classifier using a number of real-world data sets. Our performance studies show that the SVM-based classifier is indeed a feasible and effective approach in distinguishing EU machines from LMS machines. For example, training and testing on an aggregated data set containing both EU machines and LMS machines, on average, we found that the SVM-based classifier can achieve a 99.25% detection accuracy, with very small false positive rate 0.35% and false negative rate 1.27%, significantly outperforming eight Domain Name System-based blacklists widely used today. Copyright © 2012 John Wiley & Sons, Ltd.

[1]  Mengjun Xie,et al.  A Collaboration-based Autonomous Reputation System for Email Services , 2010, 2010 Proceedings IEEE INFOCOM.

[2]  Phillip M. Hallam-Baker,et al.  DomainKeys Identified Mail (DKIM) Service Overview , 2009, RFC.

[3]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[4]  Bradley Taylor,et al.  Sender Reputation in a Large Webmail Service , 2006, CEAS.

[5]  Kurt Hornik,et al.  Support Vector Machines in R , 2006 .

[6]  Bernhard Schölkopf,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[7]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[8]  Nick Feamster,et al.  Can DNS-Based Blacklists Keep Up with Bots? , 2006, CEAS.

[9]  Matthew Sullivan,et al.  Suggested Generic DNS Naming Schemes for Large Networks and Unassigned hosts. , 2006 .

[10]  Michele Colajanni,et al.  HoneySpam: Honeypots Fighting Spam at the Source , 2005, SRUTI.

[11]  Meng Weng Wong,et al.  Sender Policy Framework (SPF) for Authorizing Use of Domains in E-Mail, Version 1 , 2006, RFC.

[12]  Информатика Public Suffix List , 2010 .

[13]  Geoff Hulten,et al.  Spamming botnets: signatures and characteristics , 2008, SIGCOMM '08.

[14]  Nick Feamster,et al.  Understanding the network-level behavior of spammers , 2006, SIGCOMM 2006.

[15]  Zhenhai Duan,et al.  Understanding Forgery Properties of Spam Delivery Paths , 2010 .

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[18]  Miguel Rio,et al.  Spam Email Filtering Using Network-Level Properties , 2010, ICDM.

[19]  Arvind Krishnamurthy,et al.  Studying Spamming Botnets Using Botlab , 2009, NSDI.