Do Word Clues Suffice in Detecting Spai and Phishing?

Some commercial antispam and anti-phishing products prohibit email from "blacklisted" sites that they claim send spam and phishing email, while allowing email claiming to be from "whitelisted" sites they claim are known not to send it. This approach tends to unfairly discriminate against smaller and less-known sites, and would seem to be anti-competitive. An open question is whether other clues to spam and phishing would suffice to identify it. We report on experiments we have conducted to compare different clues for automated detection tools. Results show that word clues were by far the best clues for spam and phishing, although a little bit better performance could be obtained by supplementing word clues with a few others like the time of day the email was sent and inconsistency in headers. We also compared different approaches to combining clues to spam such as Bayesian reasoning, case-based reasoning, and neural networks; Bayesian reasoning performed the best. Our conclusion is that Bayesian reasoning on word clues is sufficient for antispam software and that blacklists and whitelists are unnecessary.

[1]  Gordon V. Cormack,et al.  On-line spam filter fusion , 2006, SIGIR.

[2]  Gordon V. Cormack,et al.  Spam and the ongoing battle for the inbox , 2007, CACM.

[3]  David S. Barnes,et al.  A Defense-in-Depth Approach to Phishing , 2006 .

[4]  Gongzhu Hu,et al.  Identification of deliberately doctored text documents using frequent keyword chain (FKC) model , 2003, Proceedings Fifth IEEE Workshop on Mobile Computing Systems and Applications.

[5]  Marti A. Hearst,et al.  Why phishing works , 2006, CHI.

[6]  Chih-Chin Lai,et al.  An empirical performance comparison of machine learning methods for spam e-mail categorization , 2004, Fourth International Conference on Hybrid Intelligent Systems (HIS'04).

[7]  Katsuyuki Yamazaki,et al.  Density-based spam detector , 2004, IEICE Trans. Inf. Syst..

[8]  Neil C. Rowe Marie-4: A High-Recall, Self-Improving Web Crawler That Finds Images Using Captions , 2002, IEEE Intell. Syst..

[9]  Minoru Sasaki,et al.  Spam detection using text clustering , 2005, 2005 International Conference on Cyberworlds (CW'05).

[10]  N.C. Rowe,et al.  Fake Honeypots: A Defensive Tactic for Cyberspace , 2006, 2006 IEEE Information Assurance Workshop.

[11]  Du Zhang,et al.  Some empirical results on two spam detection methods , 2004, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration, 2004. IRI 2004..

[12]  José María Gómez Hidalgo,et al.  Content based SMS spam filtering , 2006, DocEng '06.

[13]  Hal Berghel Phishing mongers and posers , 2006, CACM.