An Assessment of Case-Based Reasoning for Spam Filtering

Because of the changing nature of spam, a spam filtering system that uses machine learning will need to be dynamic. This suggests that a case-based (memory-based) approach may work well. Case-Based Reasoning (CBR) is a lazy approach to machine learning where induction is delayed to run time. This means that the case base can be updated continuously and new training data is immediately available to the induction process. In this paper we present a detailed description of such a system called ECUE and evaluate design decisions concerning the case representation. We compare its performance with an alternative system that uses Naïve Bayes. We find that there is little to choose between the two alternatives in cross-validation tests on data sets. However, ECUE does appear to have some advantages in tracking concept drift over time.

[1]  Padraig Cunningham,et al.  An Analysis of Case-Base Editing in a Spam Filtering System , 2004, ECCBR.

[2]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[3]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[4]  Mads Haahr,et al.  A Case-Based Approach to Spam Filtering that Can Track Concept Drift , 2003 .

[5]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[6]  Stefan Wess,et al.  Case-Based Reasoning Technology: From Foundations to Applications , 1998, Lecture Notes in Computer Science.

[7]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[8]  Chris Mellish,et al.  Advances in Instance Selection for Instance-Based Learning Algorithms , 2002, Data Mining and Knowledge Discovery.

[9]  Michel Manago,et al.  Diagnosis and Decision Support , 1998, Case-Based Reasoning Technology.

[10]  Ron Kohavi,et al.  Improving simple Bayes , 1997 .

[11]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[12]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[13]  Georgios Paliouras,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2004, Information Retrieval.

[14]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[15]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[16]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[17]  Tim Niblett,et al.  Constructing Decision Trees in Noisy Domains , 1987, EWSL.

[18]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[19]  Patrick Pantel,et al.  SpamCop: A Spam Classification & Organisation Program , 1998, AAAI 1998.

[20]  Maciej Ceglowski,et al.  Semantic Search of Unstructured Data using Contextual Network Graphs , 2003 .

[21]  Kevin R. Gee Using latent semantic indexing to filter spam , 2003, SAC '03.

[22]  KarkaletsisVangelis,et al.  A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists , 2003 .

[23]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[24]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[25]  Barry Smyth,et al.  Competence-guided Editing Methods for Lazy Learning , 2000, ECAI.