论文信息 - Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering

Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering

Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68% on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus.

William S. Yerazunis | Christian Siefkes | Fidelis Assis | Shalendra Chhabra

[1] William S. Yerazunis. Sparse Binary Polynomial Hashing and the CRM114 Discriminator , 2006 .

[2] José María Gómez Hidalgo,et al. Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[3] Ido Dagan,et al. Mistake-Driven Learning in Text Categorization , 1997, EMNLP.

[4] W. Yerazunis. The Spam-Filtering Accuracy Plateau at 99 . 9 % Accuracy and How to Get Past It . , .

[5] Yoram Singer,et al. Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[6] Dan Roth,et al. A Learning Approach to Shallow Parsing , 1999, EMNLP.

[7] W. S. Yerazunis. The Spam-Filtering Accuracy Plateau at 99.9 percent Accuracy and How to Get Past It , 2004 .

[8] Le Zhang,et al. Filtering Junk Mail with a Maximum Entropy Model , 2003 .

[9] Erhard Konrad,et al. A Toolkit for Caching and Prefetching in the Context of Web Application Platforms , 2002 .

[10] N. Littlestone. Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).