Using Biased Discriminant Analysis for Email Filtering

This paper reports on email filtering based on content features. We test the validity of a novel statistical feature extraction method, which relies on dimensionality reduction to retain the most informative and discriminative features from messages. The approach, named Biased Discriminant Analysis (BDA), aims at finding a feature space transformation that closely clusters positive examples while pushing away the negative ones. This method is an extension of Linear Discriminant Analysis (LDA), but introduces a different transformation to improve the separation between classes and it has up till now not been applied for text mining tasks. We successfully test BDA under two schemas. The first one is a traditional classification scenario using a 10-fold cross validation for four ground truth standard corpora: LingSpam, SpamAssassin, Phishing corpus and a subset of the TREC 2007 spam corpus. In the second schema we test the anticipatory properties of the statistical features with the TREC 2007 spam corpus. The contributions of this work is the evidence that BDA offers better discriminative features for email filtering, gives stable classification results notwithstanding the amount of features chosen, and robustly retains their discriminative value over time.

[1]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[2]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[3]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[4]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[5]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[6]  Jácint Szabó,et al.  Linked latent Dirichlet allocation in web spam filtering , 2009, AIRWeb '09.

[7]  Simson L. Garfinkel,et al.  Stopping Spam , 1998 .

[8]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[9]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[10]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[11]  Bo Yu,et al.  A comparative study for content-based dynamic spam classification using four machine learning algorithms , 2008, Knowl. Based Syst..

[12]  Gordon V. Cormack,et al.  TREC 2006 Spam Track Overview , 2006, TREC.

[13]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[14]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[15]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[16]  Efstathios Stamatatos,et al.  Words versus Character n-Grams for Anti-Spam Filtering , 2007, Int. J. Artif. Intell. Tools.

[17]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[18]  Edward Y. Chang,et al.  Active Learning for Interactive Multimedia Retrieval , 2008, Proceedings of the IEEE.

[19]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[20]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[21]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[22]  G. Bormetti Option Pricing under Ornstein-uhlenbeck Stochastic Volatility Received (day Month Year) Revised (day Month Year) , 2002 .

[23]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..