Clustering spam domains and hosts: anti-spam forensics with data mining

Spam related cyber crimes, including phishing, malware and online fraud, are a serious threat to society. Spam filtering has been the major weapon against spam for many years but failed to reduce the number of spam emails. To hinder spammers’ capability of sending spam, their supporting infrastructure needs to be disrupted. Terminating spam hosts will greatly reduce spammers’ profit and thwart their ability to commit spam-related cyber crimes. This research proposes an algorithm for clustering spam domains based on the hosting IP addresses and related email subjects. The algorithm can also detect significant hosts over a period of time. Experimental results show that when domain names are investigated, many seemingly unrelated spam emails are actually related. By using wildcard DNS records and constantly replacing old domains with new domains, spammers can effectively defeat URL or domain based blacklisting. Spammers also refresh hosting IP addresses occasionally, but less frequently than domains. The identified domains and their hosting IP addresses can be used by cyber-crime investigators as leads to trace the identities of spammers and shut down the related spamming infrastructure. This paper demonstrates how data mining can help to detect spam domains and their hosts for anti-spam forensic purposes. Keywords: spam, forensics, clustering, data mining

[1]  Nick Feamster,et al.  Revealing Botnet Membership Using DNSBL Counter-Intelligence , 2006, SRUTI.

[2]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[3]  C. Pu,et al.  An Anti-spam Filter Combination Framework for Text-and-Image Emails through Incremental Learning , 2009 .

[4]  Farnam Jahanian,et al.  The Zombie Roundup: Understanding, Detecting, and Disrupting Botnets , 2005, SRUTI.

[5]  Felix C. Freiling,et al.  Measuring and Detecting Fast-Flux Service Networks , 2008, NDSS.

[6]  Fabio Roli,et al.  Image Spam Filtering by Content Obscuring Detection , 2007, CEAS.

[7]  Calton Pu,et al.  Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically , 2006, CEAS.

[8]  Anthony Skjellum,et al.  Mining spam email to identify common origins for forensic application , 2008, SAC '08.

[9]  Aoying Zhou,et al.  Tracking clusters in evolving data streams over sliding windows , 2008, Knowledge and Information Systems.

[10]  Felix C. Freiling,et al.  Measurements and Mitigation of Peer-to-Peer-based Botnets: A Case Study on Storm Worm , 2008, LEET.

[11]  Calton Pu,et al.  Characterizing Web Spam Using Content and HTTP Session Analysis , 2007, CEAS.

[12]  Seunghak Lee,et al.  Dynamically Weighted Hidden Markov Model for Spam Deobfuscation , 2007, IJCAI.

[13]  Vinod Yegneswaran,et al.  BotHunter: Detecting Malware Infection Through IDS-Driven Dialog Correlation , 2007, USENIX Security Symposium.

[14]  Nick Feamster,et al.  Understanding the network-level behavior of spammers , 2006, SIGCOMM.

[15]  A. Blumstein,et al.  Deterrence and incapacitation : estimating the effects of criminal sanctions on crime rates , 1980 .

[16]  Chun Wei,et al.  Clustering malware-generated spam emails with a novel fuzzy string matching algorithm , 2009, SAC '09.

[17]  R. Clayton How much did shutting down McColo help ? , 2009 .

[18]  N. Soonthornphisaj,et al.  Anti-spam filtering: a centroid-based classification approach , 2002, 6th International Conference on Signal Processing, 2002..

[19]  Marie-Francine Moens,et al.  Detecting Known and New Salting Tricks in Unwanted Emails , 2008, CEAS.

[20]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[21]  Y. Wang,et al.  Fighting cybercrime: legislation in China , 2009, Int. J. Electron. Secur. Digit. Forensics.

[22]  Guofei Gu,et al.  BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection , 2008, USENIX Security Symposium.

[23]  Chris Kanich,et al.  Spamalytics: an empirical analysis of spam marketing conversion , 2009, CACM.

[24]  SpitznerLance The Honeynet Project , 2003, S&P 2003.

[25]  Zhe Wang,et al.  Filtering Image Spam with Near-Duplicate Detection , 2007, CEAS.

[26]  Lorenzo Martignoni,et al.  FluXOR: Detecting and Monitoring Fast-Flux Service Networks , 2008, DIMVA.

[27]  Brent Byunghoon Kang,et al.  Peer-to-Peer Botnets: Overview and Case Study , 2007, HotBots.

[28]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[29]  Calton Pu,et al.  Observed Trends in Spam Construction Techniques: A Case Study of Spam Evolution , 2006, CEAS.

[30]  W. Timothy Strayer,et al.  Detecting Botnets with Tight Command and Control , 2006, Proceedings. 2006 31st IEEE Conference on Local Computer Networks.

[31]  Wei-bang Chen,et al.  Spam Image Clustering for Identifying Common Sources of Unsolicited Emails , 2009, Int. J. Digit. Crime Forensics.

[32]  Nick Feamster,et al.  Can DNS-Based Blacklists Keep Up with Bots? , 2006, CEAS.

[33]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[34]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[35]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[36]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[37]  David Mazières,et al.  Kademlia: A Peer-to-Peer Information System Based on the XOR Metric , 2002, IPTPS.

[38]  Mehdi Behzad,et al.  Graphs and Digraphs , 1981, The Mathematical Gazette.

[39]  Shigeki Goto,et al.  Understanding the World's Worst Spamming Botnet , 2009 .

[40]  Nick Feamster,et al.  Dynamics of Online Scam Hosting Infrastructure , 2009, PAM.

[41]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[42]  R. Schoof,et al.  Detecting peer-to-peer botnets , 2007 .

[43]  Honglak Lee,et al.  Spam Deobfuscation using a Hidden Markov Model , 2005, CEAS.

[44]  Duane Wessels,et al.  Passive Monitoring of DNS Anomalies , 2007, DIMVA.

[45]  Calton Pu,et al.  Predicting web spam with HTTP session information , 2008, CIKM '08.

[46]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[47]  Wagner Meira,et al.  A Campaign-based Characterization of Spamming Strategies , 2008, CEAS.

[48]  Thorsten Holz,et al.  Rishi: Identify Bot Contaminated Hosts by IRC Nickname Evaluation , 2007, HotBots.

[49]  U. Flegel,et al.  Detection of Intrusions and Malware & Vulnerability Assessment , 2004 .

[50]  Sid Stamm,et al.  Fighting unicode-obfuscated spam , 2007, eCrime '07.

[51]  Guofei Gu,et al.  BotSniffer: Detecting Botnet Command and Control Channels in Network Traffic , 2008, NDSS.

[52]  Anthony Skjellum,et al.  Clustering Spam Domains and Destination Websites: Digital Forensics with Data Mining , 2010, J. Digit. Forensics Secur. Law.

[53]  James A. Herson,et al.  Image analysis for efficient categorization of image-based spam e-mail , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[54]  Daniel Massey,et al.  Analyzing the Aftermath of the McColo Shutdown , 2009, 2009 Ninth Annual International Symposium on Applications and the Internet.

[55]  Lorrie Faith Cranor,et al.  An Empirical Analysis of Phishing Blacklists , 2009, CEAS 2009.

[56]  Santosh S. Vempala,et al.  Filtering spam with behavioral blacklisting , 2007, CCS '07.

[57]  Daniel Barbará,et al.  Requirements for clustering data streams , 2002, SKDD.

[58]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[59]  Zili Zhang,et al.  An email classification model based on rough set theory , 2005, Proceedings of the 2005 International Conference on Active Media Technology, 2005. (AMT 2005)..

[60]  Stefan Savage,et al.  Spamscatter: Characterizing Internet Scam Hosting Infrastructure , 2007, USENIX Security Symposium.