Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering

Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.