High Efficiency Spam Filtering: A Manifold Learning-Based Approach

Spam filtering, which refers to detecting unsolicited, unwanted, and virus-infested emails, is a significant problem because spam emails lead to unnecessary costs of Internet resources, waste of people’s time, and even loss of property. Support vector machine (SVM) is the state-of-the-art method for high accuracy spam filtering. However, SVM incurs high time complexity because of the high dimensionality of the emails. In this study, we propose a manifold learning-based approach for time-efficient spam filtering. From the experiments that most of the features are not decisive, we can obtain the viewpoint that only a minor part of the spam emails can be detected using the nondecisive features. Based on the insight, we propose to employ the Laplace feature map algorithm to obtain the geometrical information from the email text datasets and extract the decisive features. Then, the extracted features are used as the input of SVM to spam filtering. We conduct extensive experiments on three datasets, and the evaluation results indicate the high accuracy time efficiency of our proposed algorithm.

[1]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[2]  Sreerama K. Murthy,et al.  Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey , 1998, Data Mining and Knowledge Discovery.

[3]  Bahriye Akay,et al.  Spam filtering using a logistic regression model trained by an artificial bee colony algorithm , 2020, Appl. Soft Comput..

[4]  C. D. Jaidhar,et al.  Applicability of machine learning in spam and phishing email filtering: review and approaches , 2020, Artificial Intelligence Review.

[5]  Yang Gao,et al.  Apply Stacked Auto-Encoder to Spam Detection , 2015, ICSI.

[6]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[7]  Gordon V. Cormack,et al.  Spam and the ongoing battle for the inbox , 2007, CACM.

[8]  Nizar Bouguila,et al.  A study of spam filtering using support vector machines , 2010, Artificial Intelligence Review.

[9]  Ala’ M. Al-Zoubi,et al.  Spam Emails Detection Based on Distributed Word Embedding with Deep Learning , 2020 .

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Tian Xia,et al.  A Constant Time Complexity Spam Detection Algorithm for Boosting Throughput on Rule-Based Filtering Systems , 2020, IEEE Access.

[12]  Haiying Shen,et al.  Leveraging Social Networks for Effective Spam Filtering , 2014, IEEE Transactions on Computers.

[13]  Chunhua Wang,et al.  Machine Learning and Deep Learning Methods for Cybersecurity , 2018, IEEE Access.

[14]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[15]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[16]  Maozhen Li,et al.  A survey of emerging approaches to spam filtering , 2012, CSUR.

[17]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[18]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[19]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[20]  José Ramon Méndez,et al.  A new semantic-based feature selection method for spam filtering , 2019, Appl. Soft Comput..

[21]  Vangelis Metsis,et al.  Spam Filtering with Naive Bayes - Which Naive Bayes? , 2006, CEAS.

[22]  Aristidis Likas,et al.  Deep Belief Networks for Spam Filtering , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[23]  Philip S. Yu,et al.  Robust Spammer Detection by Nash Reinforcement Learning , 2020, KDD.

[24]  Chien-Cheng Lee,et al.  Caption Localization and Detection for News Videos Using Frequency Analysis and Wavelet Features , 2007 .

[25]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[26]  Vladimir Vapnik,et al.  Support-vector networks , 2004, Machine Learning.

[27]  Tiago A. Almeida,et al.  Text normalization and semantic indexing to enhance Instant Messaging and SMS spam filtering , 2016, Knowl. Based Syst..

[28]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[29]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[30]  Erhan Guven,et al.  A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection , 2016, IEEE Communications Surveys & Tutorials.

[31]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[32]  H. Zha,et al.  Principal manifolds and nonlinear dimensionality reduction via tangent space alignment , 2004, SIAM J. Sci. Comput..

[33]  Patrick P. K. Chan,et al.  Spam filtering for short messages in adversarial environment , 2015, Neurocomputing.

[34]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[35]  Jin Yang,et al.  Spam transaction attack detection model based on GRU and WGAN-div , 2020, Comput. Commun..

[36]  Xuemin Chen,et al.  A weighted feature enhanced Hidden Markov Model for spam SMS filtering , 2021, Neurocomputing.

[37]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[38]  Shan Jiang,et al.  Pattern-RL: Multi-robot Cooperative Pattern Formation via Deep Reinforcement Learning , 2019, 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA).

[39]  Karthika Renuka,et al.  Latent Semantic Indexing Based SVM Model for Email Spam Classification , 2014 .

[40]  Aliaksandr Barushka,et al.  Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks , 2018, Applied Intelligence.