Filtering Image Spam with Near-Duplicate Detection

A new trend in email spam is the emergence of image spam. Although current anti-spam technologies are quite successful in filtering text-based spam emails, the new image spams are substantially more difficult to detect, as they employ a variety of image creation and randomization algorithms. Spam image creation algorithms are designed to defeat well-known vision algorithms such as optical character recognition (OCR) algorithms whereas randomization techniques ensure the uniqueness of each image. We observe that image spam is often sent in batches that consist of visually similar images that differ only due to the application of randomization algorithms. Based on this observation, we propose an image spam detection system that uses near-duplicate detection to detect spam images. We rely on traditional anti-spam methods to detect a subset of spam images and then use multiple image spam filters to detect all the spam images that “look” like the spam caught by traditional methods. We have implemented a prototype system to achieve high detection rate while having a less than 0.001% false positive rate.

[1]  William T. Freeman,et al.  Orientation Histograms for Hand Gesture Recognition , 1995 .

[2]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[3]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[4]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Remco C. Veltkamp,et al.  Content-based image retrieval systems: A survey , 2000 .

[6]  Shih-Fu Chang,et al.  Detecting image near-duplicate by stochastic attributed relational graph matching with learning , 2004, MULTIMEDIA '04.

[7]  Yan Ke,et al.  An efficient parts-based near-duplicate and sub-image retrieval system , 2004, MULTIMEDIA '04.

[8]  Jeffrey O. Kephart,et al.  SpamGuru: An Enterprise Anti-Spam Filtering System , 2004, CEAS.

[9]  Kwang-Ting Cheng,et al.  Using visual features for anti-spam filtering , 2005, IEEE International Conference on Image Processing 2005.

[10]  John R. Levine Experiences with Greylisting , 2005, CEAS.

[11]  James A. Herson,et al.  Image analysis for efficient categorization of image-based spam e-mail , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[12]  Zhe Wang,et al.  Ferret: a toolkit for content-based similarity search of feature-rich data , 2006, EuroSys.

[13]  Nick Feamster,et al.  Understanding the network-level behavior of spammers , 2006, SIGCOMM.

[14]  Calton Pu,et al.  Observed Trends in Spam Construction Techniques: A Case Study of Spam Evolution , 2006, CEAS.