A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

Typically, spam filters are built on the assumption that the characteristics of e-mails in the training dataset is identical to those in individual users’ inboxes on which it will be applied. This assumption is oftentimes incorrect leading to poor performance of the filter. A personalized spam filter is built by taking into account the characteristics of e-mails in individual users’ inboxes. We present an automatic approach for personalized spam filtering that does not require users’ feedback. The proposed algorithm builds a statistical model of spam and non-spam words from the labeled training dataset and then updates it in two passes over the unlabeled individual user’s inbox. The personalization of the model leads to improved filtering performance. We perform extensive experimentation and report results using several performance measures with a discussion on tuning the two parameters of the algorithm.