Analyzing the Performance of Spam Filtering Methods When Dimensionality of Input Vector Changes

Spam is a complex problem that makes difficult the exploitation of Internet resources. In this sense, several authorities have alerted about the dimension of this problem and aim everybody to fight against it. In this paper we present an extensive analysis showing how the effect of changing the dimensionality of message representation influences the accuracy of some well-known classical spam filtering techniques. The conclusions drawn from the experiments carried out will be useful for building a comparison of the dimensionality reorganization effects between classical filtering techniques and a successful spam filter model called SpamHunting .

[1]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[2]  D. H. Crocker,et al.  Standard for the format of arpa intemet text messages , 1982 .

[3]  David J. Hand,et al.  Averaging Over Decision Stumps , 1994, ECML.

[4]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[5]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[6]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[7]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[8]  Robert Burgin,et al.  Performance Standards and Evaluations in IR Test Collections: Cluster-Based Retrieval Models , 1997, Inf. Process. Manag..

[9]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[10]  Ian D. Watson,et al.  Case-based reasoning is a methodology not a technology , 1999, Knowl. Based Syst..

[11]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Mads Haahr,et al.  A Case-Based Approach to Spam Filtering that Can Track Concept Drift , 2003 .

[14]  Isidore Rigoutsos,et al.  Chung-Kwei: a Pattern-discovery-based System for the Automatic Identification of Unsolicited E-mail Messages (SPAM) , 2004, CEAS.

[15]  Spammer-X. Inside the Spam Cartel: Trade Secrets from the Dark Side , 2004 .

[16]  Andrew Kinley,et al.  Acquiring Similarity Cases for Classification Problems , 2005, ICCBR.

[17]  Juan M. Corchado,et al.  Tokenising, Stemming and Stopword Removal on Anti-spam Filtering Domain , 2005, CAEPIA.

[18]  Padraig Cunningham,et al.  An Assessment of Case-Based Reasoning for Spam Filtering , 2005, Artificial Intelligence Review.

[19]  Juan M. Corchado,et al.  A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain , 2006, ICDM.

[20]  José Ramon Méndez Adaptive System with Intelligent Labelling for the Classification of E-mail Spam , 2006, Inteligencia Artif..

[21]  Juan M. Corchado,et al.  Tracking Concept Drift at Feature Selection Stage in SpamHunting: An Anti-spam Instance-Based Reasoning System , 2006, ECCBR.

[22]  Juan M. Corchado,et al.  Applying lazy learning algorithms to tackle concept drift in spam filtering , 2007, Expert Syst. Appl..

[23]  Juan M. Corchado,et al.  SpamHunting: An instance-based reasoning system for spam labelling and filtering , 2007, Decis. Support Syst..

[24]  C. J. van Rijsbergen,et al.  Information Retrieval , 1979, Encyclopedia of GIS.