Combining neural networks and semantic feature space for email classification

Email is one of the most ubiquitous and pervasive applications used on a daily basis by millions of people worldwide, individuals and organizations more and more rely on the emails to communicate and share information and knowledge. However, the increase in email users has resulted in a dramatic increase in spam emails during the past few years. It is becoming a big challenge to process and manage the emails efficiently for and individuals and organizations. This paper proposes new email classification models using a linear neural network trained by perceptron learning algorithm and a nonlinear neural network trained by back-propagation learning algorithm. An efficient semantic feature space (SFS) method is introduced in these classification models. The traditional back-propagation neural network (BPNN) has slow learning speed and is prone to trap into a local minimum, so the modified back-propagation neural network (MBPNN) is presented to overcome these limitations. The vector space model based email classification system suffers from a large number of features and ambiguity in the meaning of terms, which will lead to sparse and noisy feature space. So we use the SFS to convert the original sparse and noisy feature space to a semantically richer feature space, which will helps to accelerate the learning speed. The experiments are conducted based on different training set size and extracted feature size. Experimental results show that the models using MBPNN outperform the traditional BPNN, and the use of SFS can greatly reduce the feature dimensionality and improve email classification performance.

[1]  Zhen Liu,et al.  SVM Classifier Incorporating Feature Selection Using GA for Spam Detection , 2005, EUC.

[2]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[3]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[4]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[5]  Fayez Gebali,et al.  Binary LNS-based naive Bayes hardware classifier for spam control , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[6]  Irena Koprinska,et al.  Phrases and Feature Selection in E-Mail Classification , 2004, ADCS.

[7]  Padraig Cunningham,et al.  A case-based technique for tracking concept drift in spam filtering , 2004, Knowl. Based Syst..

[8]  Fernando José Von Zuben,et al.  An Immunological Filter for Spam , 2006, ICARIS.

[9]  Dik Lun Lee,et al.  Feature reduction for neural network based text categorization , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[10]  Irena Koprinska,et al.  A neural network based approach to automated e-mail classification , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[11]  Ioannis G. Tsoulos,et al.  Neural Recognition and Genetic Features Selection for Robust Detection of E-Mail Spam , 2006, SETN.

[12]  Mark R. Lehto,et al.  Hybrid singular value decomposition; a model of human text classification , 2006 .

[13]  Zhisheng You,et al.  Immune-Based Peer-to-Peer Model for Anti-spam , 2006, ICIC.

[14]  Joshua Alspector,et al.  The Impact of Feature Selection on Signature-Driven Spam Detection , 2004, CEAS.

[15]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[16]  Jongho Kim,et al.  An Approach for Spam E-mail Detection with Support Vector Machine and n-Gram Indexing , 2004, ISCIS.

[17]  Kazem Taghva,et al.  Ontology-based classification of email , 2003, Proceedings ITCC 2003. International Conference on Information Technology: Coding and Computing.

[18]  Wei Wu,et al.  Deterministic convergence of an online gradient method for BP neural networks , 2005, IEEE Transactions on Neural Networks.