Generating Estimates of Classification Confidence for a Case-Based Spam Filter

Producing estimates of classification confidence is surprisingly difficult. One might expect that classifiers that can produce numeric classification scores (e.g. k-Nearest Neighbour, Naive Bayes or Support Vector Machines) could readily produce confidence estimates based on thresholds. In fact, this proves not to be the case, probably because these are not probabilistic classifiers in the strict sense. The numeric scores coming from k-Nearest Neighbour, Naive Bayes and Support Vector Machine classifiers are not well correlated with classification confidence. In this paper we describe a case-based spam filtering application that would benefit significantly from an ability to attach confidence predictions to positive classifications (i.e. messages classified as spam). We show that ‘obvious' confidence metrics for a case-based classifier are not effective. We propose an ensemble-like solution that aggregates a collection of confidence metrics and show that this offers an effective solution in this spam filtering domain.

[1]  Randall Davis,et al.  Expert Systems: Where Are We? And Where Do We Go from Here? , 1982, AI Mag..

[2]  Frederick Hayes-Roth,et al.  Building expert systems , 1983, Advanced book program.

[3]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[4]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1994 .

[5]  Paul W. Foos,et al.  Reasoning about reasoning. , 1996 .

[6]  Barry Smyth,et al.  Advances in Case-Based Reasoning , 1996, Lecture Notes in Computer Science.

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  Luc Lamontagne,et al.  Case-Based Reasoning Research and Development , 1997, Lecture Notes in Computer Science.

[9]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[10]  Patrick Pantel,et al.  SpamCop: A Spam Classification & Organisation Program , 1998, AAAI 1998.

[11]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[12]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[13]  William Cheetham,et al.  Case-Based Reasoning with Confidence , 2000, EWCBR.

[14]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[15]  Joshua Alspector,et al.  SVM-based Filtering of E-mail Spam with Content-specic Misclassication Costs , 2001 .

[16]  Kevin D. Ashley,et al.  Helping a CBR Program Know What It Knows , 2001, ICCBR.

[17]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[18]  D. Mareschal,et al.  Reasoning...what reasoning? , 2004, Developmental science.

[19]  Joseph Price,et al.  Measures of Solution Accuracy in Case-Based Reasoning Systems , 2004, ECCBR.

[20]  Padraig Cunningham,et al.  Explanation Oriented Retrieval , 2004, ECCBR.

[21]  Georgios Paliouras,et al.  Filtron: A Learning-Based Anti-Spam Filter , 2004, CEAS.

[22]  Padraig Cunningham,et al.  An Analysis of Case-Base Editing in a Spam Filtering System , 2004, ECCBR.

[23]  Stewart Massie,et al.  A Visualisation Tool to Explain Case-Base Reasoning Solutions for Tablet Formulation , 2004, SGAI Conf..

[24]  David McSherry,et al.  Explaining the Pros and Cons of Conclusions in CBR , 2004, ECCBR.

[25]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[26]  Padraig Cunningham,et al.  A Case-Based Explanation System for Black-Box Systems , 2005, Artificial Intelligence Review.

[27]  Padraig Cunningham,et al.  A case-based technique for tracking concept drift in spam filtering , 2004, Knowl. Based Syst..

[28]  Padraig Cunningham,et al.  An Assessment of Case-Based Reasoning for Spam Filtering , 2005, Artificial Intelligence Review.

[29]  Georgios Paliouras,et al.  Learning to Filter Unsolicited Commercial E-Mail , 2006 .

[30]  Randall,et al.  28 Meta-Level Knowledge , .