A new semantic-based feature selection method for spam filtering

Abstract The Internet emerged as a powerful infrastructure for the worldwide communication and interaction of people. Some unethical uses of this technology (for instance spam or viruses) generated challenges in the development of mechanisms to guarantee an affordable and secure experience concerning its usage. This study deals with the massive delivery of unwanted content or advertising campaigns without the accordance of target users (also known as spam). Currently, words (tokens) are selected by using feature selection schemes; they are then used to create feature vectors for training different Machine Learning (ML) approaches. This study introduces a new feature selection method able to take advantage of a semantic ontology to group words into topics and use them to build feature vectors. To this end, we have compared the performance of nine well-known ML approaches in conjunction with (i) Information Gain, the most popular feature selection method in the spam-filtering domain and (ii) Latent Dirichlet Allocation, a generative statistical model that allows sets of observations to be explained by unobserved groups that describe why some parts of the data are similar, and (iii) our semantic-based feature selection proposal. Results have shown the suitability and additional benefits of topic-driven methods to develop and deploy high-performance spam filters.

[1]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[2]  Witold Pedrycz,et al.  Positive approximation: An accelerator for attribute reduction in rough set theory , 2010, Artif. Intell..

[3]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[4]  Kashif Javed,et al.  A two-stage Markov blanket based feature selection algorithm for text classification , 2015, Neurocomputing.

[5]  Melanie Hilario,et al.  Knowledge and Information Systems , 2007 .

[6]  Yan Zhou,et al.  Combating Good Word Attacks on Statistical Spam Filters with Multiple Instance Learning , 2007 .

[7]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[8]  Tiago A. Almeida,et al.  Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering , 2016, Knowl. Based Syst..

[9]  Brahim Ouhbi,et al.  International Journal of Web Information Systems , 2022 .

[10]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[11]  S Suganya.,et al.  Syntax and Semantics based Efficient Text Classification Framework , 2013 .

[12]  Verónica Bolón-Canedo,et al.  Scaling Up Feature Selection: A Distributed Filter Approach , 2013, CAEPIA.

[13]  Aldo Gangemi,et al.  The OntoWordNet Project: Extension and Axiomatization of Conceptual Relations in WordNet , 2003, OTM.

[14]  Alper Kursat Uysal,et al.  An improved global feature selection scheme for text classification , 2016, Expert Syst. Appl..

[15]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[16]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[17]  Manasi Patwardhan,et al.  EFFICIENT SPAM CLASSIFICATION BY APPROPRIATE FEATURE SELECTION , 2013 .

[18]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[19]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[20]  Z. Pawlak Rough Sets: Theoretical Aspects of Reasoning about Data , 1991 .

[21]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[22]  Eric Allman,et al.  DomainKeys Identified Mail (DKIM) Signatures , 2007, RFC.

[23]  Zhen Liu,et al.  SVM Classifier Incorporating Feature Selection Using GA for Spam Detection , 2005, EUC.

[24]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[25]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[26]  Huan Liu,et al.  Redundancy based feature selection for microarray data , 2004, KDD.

[27]  Fernando Pérez-Cruz,et al.  Enhancing genetic feature selection through restricted search and Walsh analysis , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[28]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .

[29]  Leonard Pitt,et al.  Criteria for Polynomial-Time (Conceptual) Clustering , 1988, Machine Learning.

[30]  Juan M. Corchado,et al.  A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain , 2006, ICDM.

[31]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[32]  Florentino Fernández Riverola,et al.  WSF2: A Novel Framework for Filtering Web Spam , 2016, Sci. Program..

[33]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[34]  Ibrahim F. Moawad,et al.  Semantic-Based Feature Reduction Approach for E-mail Classification , 2016, AISI.

[35]  Florentino Fernández Riverola,et al.  Using evolutionary computation for discovering spam patterns from e-mail samples , 2018, Inf. Process. Manag..

[36]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics, ProbabilityTheory Group (Formerly: E1071), TU Wien , 2015 .

[37]  Mads Haahr,et al.  Personalised, Collaborative Spam Filtering , 2004, CEAS.

[38]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[39]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[40]  Masahiro Sowa,et al.  An Efficient Dynamic Switching Mechanism (DSM) for Hybrid Processor Architecture , 2005, EUC.

[41]  Padraig Cunningham,et al.  A Comparison of Ensemble and Case-Base Maintenance Techniques for Handling Concept Drift in Spam Filtering , 2006, FLAIRS.

[42]  Kurt Hornik,et al.  Open-source machine learning: R meets Weka , 2009, Comput. Stat..

[43]  Xin Yao,et al.  A new evolutionary system for evolving artificial neural networks , 1997, IEEE Trans. Neural Networks.

[44]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[45]  Florentino Fernández Riverola,et al.  SDAI: An integral evaluation methodology for content-based spam filtering models , 2012, Expert Syst. Appl..

[46]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[47]  Padraig Cunningham,et al.  A case-based technique for tracking concept drift in spam filtering , 2004, Knowl. Based Syst..

[48]  José Luis Rojo-Álvarez,et al.  Feature selection using support vector machines and bootstrap methods for ventricular fibrillation detection , 2012, Expert Syst. Appl..

[49]  Florentino Fernández Riverola,et al.  Wirebrush4SPAM: a novel framework for improving efficiency on spam filtering services , 2013, Softw. Pract. Exp..

[50]  Soon Myoung Chung,et al.  Text Clustering with Feature Selection by Using Statistical Data , 2008, IEEE Transactions on Knowledge and Data Engineering.

[51]  Manolis Tsiknakis,et al.  Knowledge Discovery Scientific Workflows in Clinico-Genomics , 2007 .

[52]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[53]  Nick Feamster,et al.  Can DNS-Based Blacklists Keep Up with Bots? , 2006, CEAS.

[54]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[55]  Stephen J. Wright,et al.  Big Data: Theoretical Aspects [Scanning the Issue] , 2016, Proc. IEEE.

[56]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[57]  Florentino Fernández Riverola,et al.  Rough sets for spam filtering: Selecting appropriate decision rules for boundary e-mail classification , 2012, Appl. Soft Comput..

[58]  Jerzy W. Grzymala-Busse,et al.  Rough Sets , 1995, Commun. ACM.

[59]  Miguel Rocha,et al.  A Comparative Impact Study of Attribute Selection Techniques on Naïve Bayes Spam Filters , 2008, ICDM.

[60]  Shyhtsun Felix Wu,et al.  On Attacking Statistical Spam Filters , 2004, CEAS.

[61]  Florentino Fernández Riverola,et al.  Concept drift in e-mail datasets: An empirical study with practical implications , 2018, Inf. Sci..

[62]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[63]  Wei-Chang Yeh,et al.  Feature selection with Intelligent Dynamic Swarm and Rough Set , 2010, Expert Syst. Appl..

[64]  Sven Krasser,et al.  Analyzing Network and Content Characteristics of Spim Using Honeypots , 2007, SRUTI.

[65]  Sean Bechhofer,et al.  OWL: Web Ontology Language , 2009, Encyclopedia of Database Systems.

[66]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[67]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[68]  Florentino Fernández Riverola,et al.  A dynamic model for integrating simple web spam classification techniques , 2015, Expert Syst. Appl..

[69]  Preslav Nakov Latent semantic analysis of textual data , 2000, CompSysTech '00.

[70]  M. Tech Student,et al.  Random Forest Technique for E-mail Classification , 2014 .

[71]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[72]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[73]  Rossitza Setchi,et al.  Feature selection using Joint Mutual Information Maximisation , 2015, Expert Syst. Appl..

[74]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[75]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[76]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[77]  Juan M. Corchado,et al.  Managing irrelevant knowledge in CBR models for unsolicited e-mail classification , 2009, Expert Syst. Appl..