Online supervised learning from multi-field documents for email spam filtering

Email spam filtering is considered as an online supervised learning task for binary text classification (TC). Normally, the previous statistical TC algorithms treat an email as a single plain-text document, ignoring the multi-field feature of email documents. This paper investigates the multi-field feature, and proposes a multi-field learning (MFL) approach for email spam filtering. The MFL approach divides the complex TC problem of multi-field document into several sub-problems, and conquers each sub-problem separately. At online learning, multi-scorer is learned separately within its text field according to online supervised feedbacks. At online predicting, multi-scorer's output scores are combined to predict the new document's category. The MFL framework is a general frame to combine scorers implemented by any statistical TC algorithms. However, previous TC algorithms often require great training or updating time, which are impractical for large-scale email systems. Considering the space-time spending of email spam filtering, a string-frequency index (SFI) binary TC algorithm is proposed, which is based on the straightforward conditional probability and has low space-time complexity for both online learning and online predicting. The experimental results on TREC spam track show that the performances of online Bayesian and relaxed online SVMs algorithms can be improved by the MFL approach. Especially, the proposed SFI algorithm can achieve the state-of-the-art performance at greatly reduced computational cost within the MFL framework.