Applying effective feature selection techniques with hierarchical mixtures of experts for spam classification

E-mail abuse has been steadily increasing during the last decade. E-mail users find themselves targeted by massive quantities of unsolicited bulk e-mail, which often contains offensive language or has fraudulent intentions. Internet Service Providers (ISPs) on the other hand, have to face a considerable system overloading as the incoming mail consumes network and storage resources. Among the plethora of solutions, the most prominent in terms of cost efficiency and complexity are the text filtering approaches. Most of the approaches model the problem using linear statistical models. Despite their popularity - due both to their simplicity and relative ease of interpretation - the non-linearity assumption of data samples is inappropriate in practice. This is mainly due to the inability of other approaches to capture the apparent non-linear relationships, which characterize these samples. In this paper, we propose a margin-based feature selection approach integrated with a Hierarchical Mixtures of Experts (HME) system, which attempts to overcome limitations common to other machine-learning based approaches. By reducing the data dimensionality using effective algorithms for feature selection we evaluated our system with publicly available corpora of e-mails, characterized by very high similarity between legitimate and bulk e-mail (and thus low discriminative potential). We experimented with two different architectures, a hierarchical HME and a perceptron HME. As a result, we confirm the domination of our Spam Filtering (SF) - HME method against other machine learning approaches, which present lesser degree of recall, as well as against traditional rule-based approaches, which lack considerably in the achieved degrees of precision.

[1]  John R. Levine Experiences with Greylisting , 2005, CEAS.

[2]  Christopher Meek,et al.  Good Word Attacks on Statistical Spam Filters , 2005, CEAS.

[3]  Naftali Tishby,et al.  Margin based feature selection - theory and algorithms , 2004, ICML.

[4]  Geoffrey J. McLachlan,et al.  Using the EM algorithm to train neural networks: misconceptions and a new algorithm for multiclass classification , 2004, IEEE Transactions on Neural Networks.

[5]  Karl-Michael Schneider Learning to Filter Junk E-Mail from Positive and Unlabeled Examples , 2004, IJCNLP.

[6]  David Geer Will New Standards Help Curb Spam? , 2004, Computer.

[7]  Pawel Gburzynski,et al.  Fighting the spam wars: A remailer approach with restrictive aliasing , 2004, TOIT.

[8]  Joshua Alspector,et al.  The Impact of Feature Selection on Signature-Driven Spam Detection , 2004, CEAS.

[9]  Tom Fawcett,et al.  "In vivo" spam filtering: a challenge problem for KDD , 2003, SKDD.

[10]  Carl Vogel,et al.  Spam filters: bayes vs. chi-squared; letters vs. words , 2003, ISICT.

[11]  Murray Aitkin,et al.  Statistical modelling of artificial neural networks using the multi-layer perceptron , 2003, Stat. Comput..

[12]  Kevin R. Gee Using latent semantic indexing to filter spam , 2003, SAC '03.

[13]  Muhammad E. Shaaban,et al.  Identifying junk electronic mail in Microsoft outlook with a support vector machine , 2003, 2003 Symposium on Applications and the Internet, 2003. Proceedings..

[14]  Mads Haahr,et al.  A Case-Based Approach to Spam Filtering that Can Track Concept Drift , 2003 .

[15]  John Ioannidis Fighting Spam by Encapsulating Policy in Email Addresses , 2003, NDSS.

[16]  Stephen Hinde Spam, scams, chains, hoaxes and other junk mail , 2002, Comput. Secur..

[17]  José María Gómez Hidalgo,et al.  Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[18]  Koby Crammer,et al.  Margin Analysis of the LVQ Algorithm , 2002, NIPS.

[19]  Stan Matwin,et al.  Email classification with co-training , 2011, CASCON.

[20]  George M. Mohay,et al.  Mining e-mail content for author identification forensics , 2001, SGMD.

[21]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[22]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[23]  Christopher Meek,et al.  Challenges of the Email Domain for Text Classification , 2000, ICML.

[24]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[25]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[26]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[27]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[28]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[29]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[30]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[31]  Alexander H. Waibel,et al.  Context-dependent hybrid HME/HMM speech recognition using polyphone clustering decision trees , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[32]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[33]  William W. Cohen Learning Rules that Classify E-Mail , 1996 .

[34]  Ashok N. Srivastava,et al.  Nonlinear gated experts for time series: discovering regimes and avoiding overfitting , 1995, Int. J. Neural Syst..

[35]  Steve R. Waterhouse,et al.  Classification using hierarchical mixtures of experts , 1994, Proceedings of IEEE Workshop on Neural Networks for Signal Processing.

[36]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[37]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[38]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[39]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[40]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.