论文信息 - Two Phase Approach for Spam-Mail Filtering

Two Phase Approach for Spam-Mail Filtering

This paper describes a two-phase method for filtering spam mails based on textual information and hyperlinks. Since the body of a spam mail has little text information, it provides insufficient hints to distinguish spam mails from legitimate mails. To resolve this problem, we follows hyperlinks contained in the email body, fetches contents of a remote webpage, and extracts hints (i.e., features) from original email body and fetched webpages. We divided hints into two kinds of information: definite information and less definite textual information. In our experiment, the method of fetching web pages achieved an improvement of F-measure by 9.4% over the method of using an original email header and body only.

Sae-Bom Lee | Jong-Wan Kim | Sin-Jae Kang | In-Gil Nam

[1] Jihoon Yang,et al. Intelligent Email Categorization Based on Textual Information and Metadata , 2003 .

[2] Susan T. Dumais,et al. A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[3] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[4] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[5] M. Angela Sasse,et al. Successful multiparty audio communication over the Internet , 1998, CACM.

[6] Harris Drucker,et al. Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[7] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8] Yiming Yang,et al. A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[9] Thorsten Joachims,et al. Text categorization with support vector machines , 1999 .

[10] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.