Evaluating cost-sensitive Unsolicited Bulk Email categorization

In the recent years, Unsolicited Bulk Email has became an increasingly important problem, with a big economic impact. In this paper, we discuss cost-sensitive Text Categorization methods for UBE filtering. In concrete, we have evaluated a range of Machine Learning methods for the task (C4.5, Naive Bayes, PART, Support Vector Machines and Rocchio), made cost sensitive through several methods (Threshold Optimization, Instance Weighting, and Meta-Cost). We have used the Receiver Operating Characteristic Convex Hull method for the evaluation, that best suits classification problems in which target conditions are not known, as it is the case. Our results do not show a dominant algorithm nor method for making algorithms cost-sensitive, but are the best reported on the test collection used, and approach real-world hand-crafted classifiers accuracy.

[1]  R. Hall A Countermeasure to Duplicate-detecting Anti-spam Techniques , 1999 .

[2]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[3]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[4]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[5]  Jefferson Provost,et al.  Na ive-Bayes vs. Rule-Learning in Classification of Email , 1999 .

[6]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[7]  Robert C. Holte,et al.  Explicitly representing expected cost: an alternative to ROC representation , 2000, KDD '00.

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[10]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[11]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[12]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[13]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[14]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[15]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[16]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[17]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[18]  Georgios Paliouras,et al.  Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[19]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[20]  Serge Gauthronet,et al.  Unsolicited commercial communications and data protection , 2001 .

[21]  Patrick Pantel,et al.  SpamCop: A Spam Classification & Organisation Program , 1998, AAAI 1998.

[22]  Kai Ming Ting,et al.  Inducing Cost-Sensitive Trees via Instance Weighting , 1998, PKDD.

[23]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[24]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[25]  José María Gómez Hidalgo,et al.  Combining Text and Heuristics for Cost-Sensitive Spam Filtering , 2000, CoNLL/LLL.

[26]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[27]  Tom Fawcett,et al.  Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions , 1997, KDD.

[28]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[29]  Yossi Matias,et al.  Consistent, yet anonymous, Web access with LPWA , 1999, CACM.

[30]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.