Detecting image spam using visual features and near duplicate detection

Email spam is a much studied topic, but even though current email spam detecting software has been gaining a competitive edge against text based email spam, new advances in spam generation have posed a new challenge: image-based spam. Image based spam is email which includes embedded images containing the spam messages, but in binary format. In this paper, we study the characteristics of image spam to propose two solutions for detecting image-based spam, while drawing a comparison with the existing techniques. The first solution, which uses the visual features for classification, offers an accuracy of about 98%, i.e. an improvement of at least 6% compared to existing solutions. SVMs (Support Vector Machines) are used to train classifiers using judiciously decided color, texture and shape features. The second solution offers a novel approach for near duplication detection in images. It involves clustering of image GMMs (Gaussian Mixture Models) based on the Agglomerative Information Bottleneck (AIB) principle, using Jensen-Shannon divergence (JS) as the distance measure.

[2]  Jitendra Malik,et al.  Blobworld: A System for Region-Based Image Indexing and Retrieval , 1999, VISUAL.

[3]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[4]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[5]  Mark Dredze,et al.  Learning Fast Classifiers for Image Spam , 2007, CEAS.

[6]  Roger Wattenhofer,et al.  Spamato - An Extendable Spam Filter System , 2005, CEAS.

[7]  Fabio Roli,et al.  Spam Filtering Based On The Analysis Of Text Information Embedded Into Images , 2006, J. Mach. Learn. Res..

[8]  Shiri Gordon,et al.  Unsupervised Image Clustering Using the Information Bottleneck Method , 2002, DAGM-Symposium.

[9]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Zhe Wang,et al.  Filtering Image Spam with Near-Duplicate Detection , 2007, CEAS.

[12]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[13]  Shiri Gordon,et al.  An efficient image similarity measure based on approximations of KL-divergence between two gaussian mixtures , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[14]  James Morris,et al.  Markets for attention: will postage for email help? , 2002, CSCW '02.

[15]  Anil K. Jain,et al.  Texture Analysis , 2018, Handbook of Image Processing and Computer Vision.