Automatic thesaurus construction for spam filtering using revised back propagation neural network

Email has become one of the fastest and most economical forms of communication. Email is also one of the most ubiquitous and pervasive applications used on a daily basis by millions of people worldwide. However, the increase in email users has resulted in a dramatic increase in spam emails during the past few years. This paper proposes a new spam filtering system using revised back propagation (RBP) neural network and automatic thesaurus construction. The conventional back propagation (BP) neural network has slow learning speed and is prone to trap into a local minimum, so it will lead to poor performance and efficiency. The authors present in this paper the RBP neural network to overcome the limitations of the conventional BP neural network. A well constructed thesaurus has been recognized as a valuable tool in the effective operation of text classification, it can also overcome the problems in keyword-based spam filters which ignore the relationship between words. The authors conduct the experiments on Ling-Spam corpus. Experimental results show that the proposed spam filtering system is able to achieve higher performance, especially for the combination of RBP neural network and automatic thesaurus construction.

[1]  Juan M. Corchado,et al.  SpamHunting: An instance-based reasoning system for spam labelling and filtering , 2007, Decis. Support Syst..

[2]  Fernando José Von Zuben,et al.  An Immunological Filter for Spam , 2006, ICARIS.

[3]  Sarah Jane Delany,et al.  Catching the Drift: Using Feature-Free Case-Based Reasoning for Spam Filtering , 2007, ICCBR.

[4]  Chuanyi Ji,et al.  A unified approach on fast training of feedforward and recurrent networks using EM algorithm , 1998, IEEE Trans. Signal Process..

[5]  Guo-An Chen,et al.  Acceleration of backpropagation learning using optimised learning rate and momentum , 1993 .

[6]  Arjen van Ooyen,et al.  Improving the convergence of the back-propagation algorithm , 1992, Neural Networks.

[7]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[8]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[9]  Hyung Jeong Yang,et al.  Hierarchical document categorization with k-NN and concept-based thesauri , 2006, Inf. Process. Manag..

[10]  Judy Kay,et al.  Automatic Induction of Rules of e-mail Classification , 2001 .

[11]  Otávio Augusto S. Carpinteiro,et al.  A Neural Model in Anti-spam Systems , 2006, ICANN.

[12]  Ángel F. Zazo Rodríguez,et al.  Reformulation of queries using similarity thesauri , 2005, Inf. Process. Manag..

[13]  Vassilis P. Plagianakos,et al.  Training neural networks with threshold activation functions and constrained integer weights , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[14]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[15]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[16]  Hans-Peter Frei,et al.  Applying a Similarity Thesaurus to a Large Collection for Information Retrieval , 2007 .

[17]  Ioannis G. Tsoulos,et al.  Neural Recognition and Genetic Features Selection for Robust Detection of E-Mail Spam , 2006, SETN.

[18]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[19]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[20]  Irena Koprinska,et al.  Phrases and Feature Selection in E-Mail Classification , 2004, ADCS.

[21]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[22]  Lourdes Araujo,et al.  Query Expansion with an Automatically Generated Thesaurus , 2006, IDEAL.

[23]  Anirban Mondal,et al.  On Effective E-mail Classification via Neural Networks , 2005, DEXA.

[24]  Gareth J. F. Jones,et al.  Using online linear classifiers to filter spam emails , 2006, Pattern Analysis and Applications.

[25]  Wei Wu,et al.  Deterministic convergence of an online gradient method for BP neural networks , 2005, IEEE Transactions on Neural Networks.

[26]  Zhen Liu,et al.  SVM Classifier Incorporating Feature Selection Using GA for Spam Detection , 2005, EUC.

[27]  Eric P. Jiang Learning to Semantically Classify Email Messages , 2006 .

[28]  Joshua Alspector,et al.  The Impact of Feature Selection on Signature-Driven Spam Detection , 2004, CEAS.

[29]  Juan M. Corchado,et al.  Applying lazy learning algorithms to tackle concept drift in spam filtering , 2007, Expert Syst. Appl..