An Approach to Image Spam Filtering Based on Base64 Encoding and N-Gram Feature Extraction

As compared with text spam, the image spam is a variant which is invented to escape from traditional text-based spam classification and filtering. Various approaches to image spam filtering have been proposed with respective advantages and drawbacks in terms of time cost and efficiency. In this paper, we propose a new approach based on Base64 encoding of image files and $n$-gram technique for feature extraction. By transforming normal images into Base64 presentation, we try to extract features of an image with $n$-gram technique. With these features we train an SVM (support vector machine) which shows effectiveness and efficiency in detecting spam images from legitimate images. With an online shared personal corpus of images as the input, experimental results show that our approach, in comparison with some of the existing methods of feature extraction, can achieve very high performance for image spam classification in terms of some basic measures such as accuracy, precision, and recall. Moreover, our approach shows its practicability by taking less running time for image spam classification in comparison to other methods.

[1]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[2]  Yuchun Tang,et al.  Identifying Image Spam based on Header and File Properties using C4.5 Decision Trees and Support Vector Machine Learning , 2007, 2007 IEEE SMC Information Assurance and Security Workshop.

[3]  Carlo Sansone,et al.  Combining visual and textual features for filtering spam emails , 2008, 2008 19th International Conference on Pattern Recognition.

[4]  Wei Zheng,et al.  A Simple Method for Filtering Image Spam , 2009, 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science.

[5]  Zhe Wang,et al.  Filtering Image Spam with Near-Duplicate Detection , 2007, CEAS.

[6]  D. Manjula,et al.  Statistical modeling for the detection, localization and extraction of text from heterogeneous textual images using combined feature scheme , 2011, Signal Image Video Process..

[7]  Johannes Fürnkranz,et al.  A Study Using $n$-gram Features for Text Categorization , 1998 .

[8]  Fabio Roli,et al.  Spam Filtering Based On The Analysis Of Text Information Embedded Into Images , 2006, J. Mach. Learn. Res..

[9]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[10]  Weiming Hu,et al.  Image spam filtering using Fourier-Mellin invariant features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Tu Minh Phuong,et al.  An Efficient Method for Filtering Image-Based Spam E-mail , 2007, CAIP.

[12]  Fabio Roli,et al.  Image Spam Filtering Using Visual Information , 2007, 14th International Conference on Image Analysis and Processing (ICIAP 2007).

[13]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[14]  Mark Dredze,et al.  Learning Fast Classifiers for Image Spam , 2007, CEAS.

[15]  Tu Minh Phuong,et al.  An Efficient Method for Filtering Image-Based Spam , 2007, 2007 IEEE International Conference on Research, Innovation and Vision for the Future.