论文信息 - Online supervised learning from multi-field documents for email spam filtering

Online supervised learning from multi-field documents for email spam filtering

Email spam filtering is considered as an online supervised learning task for binary text classification (TC). Normally, the previous statistical TC algorithms treat an email as a single plain-text document, ignoring the multi-field feature of email documents. This paper investigates the multi-field feature, and proposes a multi-field learning (MFL) approach for email spam filtering. The MFL approach divides the complex TC problem of multi-field document into several sub-problems, and conquers each sub-problem separately. At online learning, multi-scorer is learned separately within its text field according to online supervised feedbacks. At online predicting, multi-scorer's output scores are combined to predict the new document's category. The MFL framework is a general frame to combine scorers implemented by any statistical TC algorithms. However, previous TC algorithms often require great training or updating time, which are impractical for large-scale email systems. Considering the space-time spending of email spam filtering, a string-frequency index (SFI) binary TC algorithm is proposed, which is based on the straightforward conditional probability and has low space-time complexity for both online learning and online predicting. The experimental results on TREC spam track show that the performances of online Bayesian and relaxed online SVMs algorithms can be improved by the MFL approach. Especially, the proposed SFI algorithm can achieve the state-of-the-art performance at greatly reduced computational cost within the MFL framework.

[1] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[2] Gordon V. Cormack,et al. Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[3] Thomas G. Dietterich. Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[4] D. Sculley,et al. Relaxed Online SVMs in the TREC Spam Filtering Track , 2007, TREC.

[5] Gordon V. Cormack. University of Waterloo Participation in the TREC 2007 Spam Track , 2007, TREC.

[6] Gordon V. Cormack,et al. TREC 2006 Spam Track Overview , 2006, TREC.

[7] Hwee Tou Ng,et al. Bayesian online classifiers for text classification and filtering , 2002, SIGIR '02.

[8] Harris Drucker,et al. Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[9] D. Sculley,et al. Relaxed online SVMs for spam filtering , 2007, SIGIR.

[10] JUSTIN ZOBEL,et al. Inverted files for text search engines , 2006, CSUR.

[11] Ting Wang,et al. Multi-field learning for email spam filtering , 2010, SIGIR '10.