Identification and on-line incremental clustering of spam campaigns

The ever growing spread of spam emails, despite being adequately fought by spam filters, can be more effectively addressed by understanding how spammers act. Grouping spam emails into spam campaigns, provides valuable information on many aspects; how spammers obfuscate and correlation between seemingly different spam campaigns as well as many descriptive statistics. In this thesis, we focus on identifying spam campaigns from a 7.5 months period by clustering the web pages, which are referred to by the URLs inside the spam emails, based on their content. Following that, we apply Latent Dirichlet Allocation to assign a topic to every cluster and finally, we present a mechanism that incrementally clusters the incoming spam emails into spam campaigns in an automatic and on-line environment. We argue that our method for spam campaign identification is quick and efficient, able to represent the identified spam campaigns in a compact manner. On top of that it can assist towards better understanding of the domain and its applications.

[1]  Anthony Skjellum,et al.  Mining spam email to identify common origins for forensic application , 2008, SAC '08.

[2]  Fulu Li,et al.  An Empirical Study of Clustering Behavior of Spammers and Group-based Anti-Spam Strategies , 2006, CEAS.

[3]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[4]  Stefan Savage,et al.  Spamscatter: Characterizing Internet Scam Hosting Infrastructure , 2007, USENIX Security Symposium.

[5]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[6]  Wagner Meira,et al.  A Campaign-based Characterization of Spamming Strategies , 2008, CEAS.

[7]  Feng Qian,et al.  A case for unsupervised-learning-based spam filtering , 2010, SIGMETRICS '10.

[8]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[9]  Thomas L. Griffiths,et al.  The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies , 2007, JACM.

[10]  Chris Kanich,et al.  Spamcraft: An Inside Look At Spam Campaign Orchestration , 2009, LEET.

[11]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Geoff Hulten,et al.  Spamming botnets: signatures and characteristics , 2008, SIGCOMM '08.

[14]  Helen J. Wang,et al.  Characterizing Botnets from Email Spam Records , 2008, LEET.

[15]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[16]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[17]  Peter Haider,et al.  Bayesian clustering for email campaign detection , 2009, ICML '09.