SMS Spam Detection Through Skip-gram Embeddings and Shallow Networks

The drastic decrease in mobile SMS costs turned phone users more prone to spam messages, usually with unwanted marketing or questionable content. As such, researchers have proposed different methods for detecting SMS spam messages. This paper presents a technique for embedding SMS messages into vector spaces that is suitable for spam detection. The proposed approach relies on mining patterns that are relevant for distinguishing spam from legitimate messages. A subset of those patterns is used to construct a function that maps text messages into a multidimensional vector space. The extracted patterns are represented as skip-grams of token attributes, where a skip-gram can be seen as a generalization of the n-gram model that allows a distance greater than one between matched tokens in the text. We evaluate the proposed approach using the generated vectors for spam classification on the UCI Spam Collection dataset. The experiments showed that our method combined with shallow networks reached accuracy that is competitive with state-of-the-art approaches.

[1]  Manisha Sharma,et al.  Optimizing semantic LSTM for spam detection , 2019 .

[2]  Abdallah Ghourabi,et al.  A Hybrid CNN-LSTM Model for SMS Spam Detection in Arabic and English Messages , 2020, Future Internet.

[3]  João Paulo Papa,et al.  SMS Spam Filtering Through Optimum-Path Forest-Based Classifiers , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[4]  Sunil Annareddy,et al.  A Comparative Study of Deep Learning Methods for Spam Detection , 2019, 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC).

[5]  Yorick Wilks,et al.  A Closer Look at Skip-gram Modelling , 2006, LREC.

[6]  Srdjan Sladojevic,et al.  Convolutional Neural Network Based SMS Spam Detection , 2018, 2018 26th Telecommunications Forum (TELFOR).

[7]  Mehul Gupta,et al.  A Comparative Study of Spam SMS Detection Using Machine Learning Classifiers , 2018, 2018 Eleventh International Conference on Contemporary Computing (IC3).

[8]  Mohamed Mejri,et al.  SpaML: a Bimodal Ensemble Learning Spam Detector based on NLP Techniques , 2021, 2021 IEEE 5th International Conference on Cryptography, Security and Privacy (CSP).

[9]  Adamu I. Abubakar,et al.  A Review on Mobile SMS Spam Filtering Techniques , 2017, IEEE Access.

[10]  Xuemin Chen,et al.  A Discrete Hidden Markov Model for SMS Spam Detection , 2020, Applied Sciences.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Pradeep Kumar Roy,et al.  Deep learning to filter SMS Spam , 2020, Future Gener. Comput. Syst..

[14]  Sanjay Misra,et al.  A review of soft techniques for SMS spam classification: Methods, approaches and applications , 2019, Eng. Appl. Artif. Intell..

[15]  Soomro Pir Dino,et al.  LSTM Based Short Message Service (SMS) Modeling for Spam Classification , 2018, ICMLT '18.

[16]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[17]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[18]  Uyen Trang Nguyen,et al.  A Lightweight Deep Neural Model for SMS Spam Detection , 2020, 2020 International Symposium on Networks, Computers and Communications (ISNCC).

[19]  Aliaksandr Barushka,et al.  Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks , 2018, Applied Intelligence.

[20]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[21]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.