论文信息 - COLLABORATIVE SPAM FILTERING WITH THE HASHING TRICK

COLLABORATIVE SPAM FILTERING WITH THE HASHING TRICK

User feedback is vital to the quality of the collaborative spam fi lters frequently used in open membership email systems such as Yahoo Mail or Gmail. Users occasionally designate emails as spam or non-spam (often termed as ham), and these labels are subsequently used to train the spam fi lter. Although the majority of users provide very little data, as a collective the amount of training data is very large (many millions of emails per day). Unfortunately, there is substantial deviation in users’ notions of what constitutes spam and ham. Additionally, the open membership policy of these systems makes it vulnerable to users with malicious intent – spammers who wish to see their emails accepted by any spam fi ltration system can create accounts and use these to give malicious feedback to ‘train’ the spam fi lter in giving their emails a free pass. When combined, these realities make it extremely diffi cult to assemble a single, global spam classifi er.

Alexander J. Smola | Kilian Q. Weinberger | Martin Zinkevich | Anirban Dasgupta | Josh Attenberg

[1] Gordon V. Cormack,et al. TREC 2006 Spam Track Overview , 2006, TREC.

[2] Kilian Q. Weinberger,et al. Feature hashing for large scale multitask learning , 2009, ICML '09.

[3] Rich Caruana,et al. Algorithms and Applications for Multitask Learning , 1996, ICML.

[4] Alexander J. Smola,et al. Collaborative Email-Spam Filtering with the Hashing-Trick , 2009 .