Incremental Adaptive Spam Mail Filtering Using Naïve Bayesian Classification

Most content based spam filters are rule based or trained off-line. Handling new spam tactics is difficult and prone to high misclassification rate. This paper proposes an incremental adaptive spam mail filtering using Naïve Bayesian classification which gives good performance, simplicity and adaptability. We model an incremental scheme that receives a stream of emails, and applies the concept of sliding window to train only the last w emails for testing new incoming messages. Subsequently, the new features of tested messages are added to the existing features so that the model will be adaptive to future incoming emails. The proposed model is tested on two corpora: Trec05p-1 [11] and Trec06p [12]. The parameters are the window size and the number of features, and the evaluation metrics are the processing time per message, and the ham and spam misclassification rates. The experimental results show that the number of features has little impact whereas the window size has significant effects on misclassification rates and the processing time. In addition, the overall accuracy is even better than that obtained from the batch off-line training and the processing time is reduced significantly.

[1]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[2]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[3]  Jiawei Han,et al.  Data Mining: Concepts and Techniques, Second Edition , 2006, The Morgan Kaufmann series in data management systems.

[4]  Inderjit S. Dhillon,et al.  Efficient Clustering of Very Large Document Collections , 2001 .

[5]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[6]  Gordon V. Cormack,et al.  Online supervised spam filter evaluation , 2007, TOIS.

[7]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[8]  Gordon V. Cormack,et al.  Spam Corpus Creation for TREC , 2005, CEAS.

[9]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[10]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[11]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[12]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[13]  George Forman Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification , 2002, PKDD.

[14]  Sudsanguan Ngamsuriyaroj,et al.  Incremental Naïve Bayesian Spam Mail Filtering and Variant Incremental Training , 2009, 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science.

[15]  Wei Cao,et al.  York University at TREC 2005: SPAM Track , 2005, TREC.