Spam Detection Based on Feature Evolution to Deal with Concept Drift

Electronic messages are still considered the most significant tools in business and personal applications due to their low cost and easy access. However, e-mails have become a major problem owing to the high amount of junk mail, named spam, which fill the e-mail boxes of users. Several approaches have been proposed to detect spam, such as filters implemented in e-mail servers and user-based spam message classification mechanisms. A major problem with these approaches is spam detection in the presence of concept drift, especially as a result of changes in features over time. To overcome this problem, this work proposes a new spam detection system based on analyzing the evolution of features. The proposed method is divided into three steps: 1) spam classification model training; 2) concept drift detection; and 3) knowledge transfer learning. The first step generates classification models, as commonly conducted in machine learning. The second step introduces a new strategy to avoid concept drift: SFS (Similarity-based Features Se- lection) that analyzes the evolution of the features taking into account similarity obtained between the feature vectors extracted from training data and test data. Finally, the third step focuses on the following questions: what, how, and when to transfer acquired knowledge? The proposed method is evaluated using two public datasets. The results of the experiments show that it is possible to infer a threshold to detect changes (drift) in order to ensure that the spam classification model is updated through knowledge transfer. Moreover, our anomaly detection system is able to perform spam classification and concept drift detection as two parallel and independent tasks.

[1]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[2]  Ronald Rousseau,et al.  Similarity measures in scientometric research: The Jaccard index versus Salton's cosine formula , 1989, Inf. Process. Manag..

[3]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[4]  Ioannis Pitas,et al.  Demonstrating the stability of support vector machines for classification , 2006, Signal Process..

[5]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[6]  BlanzieriEnrico,et al.  A survey of learning-based techniques of email spam filtering , 2008 .

[7]  João Gama,et al.  Recurrent concepts in data streams classification , 2013, Knowledge and Information Systems.

[8]  Johannes Gehrke,et al.  Mining data streams under block evolution , 2002, SKDD.

[9]  Mamoun Alazab,et al.  An Analysis of the Nature of Spam as Cybercrime , 2017 .

[10]  Roberto Souto Maior de Barros,et al.  A comparative study on concept drift detectors , 2014, Expert Syst. Appl..

[11]  Padraig Cunningham,et al.  A Comparison of Ensemble and Case-Base Maintenance Techniques for Handling Concept Drift in Spam Filtering , 2006, FLAIRS.

[12]  John A. Kunze,et al.  Dublin Core Metadata for Resource Discovery , 1998, RFC.

[13]  Azadeh Shakery,et al.  Content-based concept drift detection for Email spam filtering , 2010, 2010 5th International Symposium on Telecommunications.

[14]  Joshua Alspector,et al.  The Impact of Feature Selection on Signature-Driven Spam Detection , 2004, CEAS.

[15]  David L. Hicks,et al.  Providing metadata services on the world wide web , 2001 .

[16]  Peter J. Nürnberg,et al.  A hypermedia version control framework , 1998, TOIS.

[17]  Uffe Kock Wiil,et al.  Hyperform: using extensibility to develop dynamic, open, and distributed hypertext systems , 1992, ECHT '92.

[18]  Peter J. Nürnberg,et al.  As we should have thought , 1997, HYPERTEXT '97.

[19]  Haruna Chiroma,et al.  Machine learning for email spam filtering: review, approaches and open research problems , 2019, Heliyon.

[20]  Juan M. Corchado,et al.  SpamHunting: An instance-based reasoning system for spam labelling and filtering , 2007, Decis. Support Syst..

[21]  Peter J. Nürnberg,et al.  Evolving hypermedia middleware services: lessons and observations , 1999, SAC '99.

[22]  ChenTsuhan,et al.  A collaborative anti-spam system , 2009 .

[23]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[24]  Christopher Leckie,et al.  A survey of coordinated attacks and collaborative intrusion detection , 2010, Comput. Secur..

[25]  Dimitris Christodoulakis,et al.  The Callimachus approach to distributed hypermedia , 1999, HYPERTEXT '99.

[26]  Uffe Kock Wiil,et al.  The HyperDisco approach to open hypermedia systems , 1996, HYPERTEXT '96.

[27]  Monica M. C. Schraefel ConTexts: Adaptable Hypermedia , 2000, AH.

[28]  Yudong Zhang,et al.  Binary PSO with mutation operator for feature selection using decision tree applied to spam detection , 2014, Knowl. Based Syst..

[29]  Frank M. Shipman,et al.  PHIDIAS: Integrating CAD Graphics into Dynamic Hypertext , 1992, ECHT.

[30]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.

[31]  Jiawei Han,et al.  Knowledge transfer via multiple model local structure mapping , 2008, KDD.

[32]  Ning Lu,et al.  Concept drift detection via competence models , 2014, Artif. Intell..

[33]  Jean-Philippe Vert,et al.  The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures , 2011, PloS one.

[34]  Maozhen Li,et al.  A survey of emerging approaches to spam filtering , 2012, CSUR.

[35]  Christophe G. Giraud-Carrier,et al.  Temporal Data Mining in Dynamic Feature Spaces , 2006, Sixth International Conference on Data Mining (ICDM'06).

[36]  Randall H. Trigg,et al.  Design issues for a Dexter-based hypermedia system , 1992, ECHT '92.

[37]  Padraig Cunningham,et al.  A case-based technique for tracking concept drift in spam filtering , 2004, Knowl. Based Syst..

[38]  Frank M. Shipman,et al.  VIKI: spatial hypertext supporting emergent structure , 1994, ECHT '94.

[39]  Wei Xu,et al.  Modeling concept drift from the perspective of classifiers , 2008, 2008 IEEE Conference on Cybernetics and Intelligent Systems.

[40]  Yehuda Koren,et al.  Collaborative filtering with temporal dynamics , 2009, KDD.

[41]  Indre Zliobaite,et al.  Learning under Concept Drift: an Overview , 2010, ArXiv.

[42]  Hossam Faris,et al.  Improving Knowledge Based Spam Detection Methods: The Effect of Malicious Related Features in Imbalance Data Distribution , 2015 .

[43]  Peter J. Nürnberg,et al.  Structuring Facilities in Digital Libraries , 1998, ECDL.

[44]  Mykola Pechenizkiy,et al.  Dynamic integration of classifiers for handling concept drift , 2008, Inf. Fusion.

[45]  Konrad Jackowski,et al.  Fixed-size ensemble classifier system evolutionarily adapted to a recurring context with an unlimited pool of classifiers , 2013, Pattern Analysis and Applications.

[46]  Peter J. Nürnberg,et al.  Hypermedia operating systems: a new paradigm for computing , 1996, HYPERTEXT '96.

[47]  João Gama,et al.  Learning with Drift Detection , 2004, SBIA.