Active learning with simplified SVMs for spam categorization

We propose a method for spam categorization based on support vector machines (SVMs) using active learning strategy. We study the use of support vector machines in classifying e-mail as spam or nonspam. But the standard algorithms for training support vector machines generally produce solutions with a greater number of support vectors than strictly necessary. An algorithm is applied in the paper that allows the unnecessary support vectors to be recognized and eliminated. We analyze the particular properties of our special task and identify why SVMs especially the simplified SVMs are appropriate for dealing with spam. Instead of using a randomly selected training set, the learner has access to a pool of unlabeled instances and can request the labels for some number of them. We introduce a new method for choosing which instances to request next.