Towards the integration of diverse spam filtering techniques

Text-based spam filters (e.g., keyword and statistical learning filters) use tokens, which are found during message content analysis, to separate spam from legitimate messages. The effectiveness of these token-based filters is due to the presence of token signatures (i.e., tokens that are invariant for the many variants of spam messages). Unfortunately, it is relatively easy for spammers to hide or erase these signatures through simple techniques such as misspellings (to confuse keyword filters) and camouflage (i.e., combined spam and legitimate content used to confuse statistical filters). Our hypothesis is that spam contains additional signatures which are more difficult to hide. A concrete example of this type of signature is the presence of URLs in spam messages which are used to induce contact from their victims. We believe diverse spam filtering tools should be developed to incorporate these additional signatures. Thus, in this paper, we discuss a new type of URL-based filtering which can be integrated with existing spam filtering techniques to provide a more robust anti-spam solution. Our approach uses the syntactic constraints of URLs to find them in emails, and then, it uses semantic knowledge and tools (e.g., search engines) to refine and sharpen the spam identification process. email's routed path. In this paper, we focus our attention on spam messages that contain URLs and provide a novel approach for filtering these messages. The key observation is that most spam messages contain URLs which are "live" since the spammers would not be able to profit without a functioning link to their site. Thus, by checking the URLs found in a message and verifying a user's interest in the websites referenced by those URLs, we are able to add a new dimension to spam filtering. This paper has two main contributions. First, we describe three techniques for filtering email messages that contain URLs: URL category whitelists, URL regular expression whitelists, and dynamic classification of websites. Second, we describe a prototype implementation that takes advantage of these three techniques to help enhance spam filtering. Our pre- liminary results suggest that new dimensions in spam filtering (e.g., using URLs) deserve further exploration. However, due to space limitations, we have omitted our experimental results from this paper. The remainder of the paper is structured as follows. Sec- tion II gives an overview of the related work done in this research area. In Section III, we describe our approach, and Section IV discusses the details of our system's implementa- tion. We provide our conclusions in Section V.