Spam Filtering Based On The Analysis Of Text Information Embedded Into Images

In recent years anti-spam filters have become necessary tools for Internet service providers to face up to the continuously growing spam phenomenon. Current server-side anti-spam filters are made up of several modules aimed at detecting different features of spam e-mails. In particular, text categorisation techniques have been investigated by researchers for the design of modules for the analysis of the semantic content of e-mails, due to their potentially higher generalisation capability with respect to manually derived classification rules used in current server-side filters. However, very recently spammers introduced a new trick consisting of embedding the spam message into attached images, which can make all current techniques based on the analysis of digital text in the subject and body fields of e-mails ineffective. In this paper we propose an approach to anti-spam filtering which exploits the text information embedded into images sent as attachments. Our approach is based on the application of state-of-the-art text categorisation techniques to the analysis of text extracted by OCR tools from images attached to e-mails. The effectiveness of the proposed approach is experimentally evaluated on two large corpora of spam e-mails.

[1]  W. Neville Holmes In Defense of Spam , 2005, Computer.

[2]  Alessandro Vinciarelli,et al.  Noisy text categorization , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Lauren Weinstein Spam wars , 2003, CACM.

[4]  I. Cloete,et al.  Learning to classify email: a survey , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[5]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[6]  Gordon V. Cormack,et al.  Online supervised spam filter evaluation , 2007, TOIS.

[7]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[8]  Henry S. Baird,et al.  ScatterType: a legible but hard-to-segment CAPTCHA , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[9]  Henry S. Baird,et al.  ScatterType: a reading CAPTCHA resistant to segmentation attack , 2005, IS&T/SPIE Electronic Imaging.

[10]  Mary Czerwinski,et al.  Building Segmentation Based Human-Friendly Human Interaction Proofs (HIPs) , 2005, HIP.

[11]  Kwang-Ting Cheng,et al.  Using visual features for anti-spam filtering , 2005, IEEE International Conference on Image Processing 2005.

[12]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[13]  David Geer Will New Standards Help Curb Spam? , 2004, Computer.

[14]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[15]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[16]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[18]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[19]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.