The application of decision tree in Chinese email classification

Email is a kind of semi-structured document, some important attributes are contained in its structure, and especially using spam-specific features could improve the email classification results. In this paper, we apply decision tree data mining technique to dig out the potential association rules among these attributes of email, and then to identify unknown email's category based on these rules. According to the experiment of applying numerous Chinese emails to our email classifier, the efficiency of our method is not lower than that of other existing methods of checking whole email content text. Meanwhile our method can reduce the cost of computation and consumption of system resources.

[1]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[2]  Te-Ming Chang,et al.  An incremental cluster-based approach to spam filtering , 2008, Expert Syst. Appl..

[3]  Bo Thiesson,et al.  Asymmetric Gradient Boosting with Application to Spam Filtering , 2007, CEAS.

[4]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[5]  Padraig Cunningham,et al.  ECUE: A Spam Filter that Uses Machine Leaming to Track Concept Drift , 2006, ECAI.

[6]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[7]  Hongjun Lu,et al.  A Comparative Study of Classification Based Personal E-mail Filtering , 2000, PAKDD.

[8]  Tony A. Meyer,et al.  SpamBayes: Effective open-source, Bayesian based, email classification system , 2004, CEAS.

[9]  Kevin R. Gee Using latent semantic indexing to filter spam , 2003, SAC '03.

[10]  Chih-Chien Wang,et al.  Using header session messages to anti-spamming , 2007, Comput. Secur..

[11]  Dennis McLeod,et al.  A Comparative Study for Email Classification , 2007 .

[12]  Tessa A. Lau,et al.  Automated email activity management: an unsupervised learning approach , 2005, IUI.

[13]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[14]  Christopher Meek,et al.  Challenges of the Email Domain for Text Classification , 2000, ICML.

[15]  Elio Masciari,et al.  Towards An Adaptive Mail Classifier , 2002 .

[16]  Anirban Mondal,et al.  On Effective E-mail Classification via Neural Networks , 2005, DEXA.