Learning Fast Classifiers for Image Spam

Recently, spammers have proliferated “image spam”, emails which contain the text of the spam message in a human readable image instead of the message body, making detection by conventional content filters difficult. New techniques are needed to filter these messages. Our goal is to automatically classify an image directly as being spam or ham. We present features that focus on simple properties of the image, making classification as fast as possible. Our evaluation shows that they accurately classify spam images in excess of 90% and up to 99% on real world data. Furthermore, we introduce a new feature selection algorithm that selects features for classification based on their speed as well as predictive power. This technique produces an accurate system that runs in a tiny fraction of the time. Finally, we introduce Just in Time (JIT) feature extraction, which creates features at classification time as needed by the classifier. We demonstrate JIT extraction using a JIT decision tree that further increases system speed. This paper makes image spam classification practical by providing both high accuracy features and a method to learn fast classifiers.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[3]  Nevenka Dimitrova,et al.  Text detection for video analysis , 1999, Proceedings IEEE Workshop on Content-Based Access of Image and Video Libraries (CBAIVL'99).

[4]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[5]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[6]  Chein-I Chang,et al.  Automated system for text detection in individual video images , 2003, J. Electronic Imaging.

[7]  Ralph Ewerth,et al.  A robust algorithm for text detection in images , 2003, 3rd International Symposium on Image and Signal Processing and Analysis, 2003. ISPA 2003. Proceedings of the.

[8]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[9]  Yoram Singer,et al.  The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees , 2004, NIPS.

[10]  Santa Barbara,et al.  Embedded-Text Detection and Its Application to Anti-Spam Filtering , 2005 .

[11]  Kwang-Ting Cheng,et al.  Using visual features for anti-spam filtering , 2005, IEEE International Conference on Image Processing 2005.

[12]  David Heckerman,et al.  Stopping spam. What can be done to stanch the flood of junk e-mail messages? , 2005, Scientific American.

[13]  Ching-Tung Wu Embedded-Text Detection and Its Application to Anti-Spam Filtering , 2005 .

[14]  James A. Herson,et al.  Image analysis for efficient categorization of image-based spam e-mail , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[15]  Fabio Roli,et al.  Spam Filtering Based On The Analysis Of Text Information Embedded Into Images , 2006, J. Mach. Learn. Res..