A comparative study of word embedding techniques for SMS spam detection

E-mail and SMS are the most popular communication tools used by businesses, organizations and educational institutions. Every day, people receive hundreds of messages which could be either spam or ham. Spam is any form of unsolicited, unwanted digital communication, usually sent out in bulk. Spam emails and SMS waste resources by unnecessarily flooding network lines and consuming storage space. Therefore, it is important to develop high accuracy spam detection models to effectively block spam messages, so as to optimize resources and protect users. Various word-embedding techniques such as Bag of Words (BOW), N-grams, TF-IDF, Word2Vec and Doc2Vec have been widely applied to NLP problems, however a comparative study of these techniques for SMS spam detection is currently lacking. Hence, in this paper, we provide a comparative analysis of these popular word embedding techniques for SMS spam detection by evaluating their performance on a publicly available ham and spam dataset. We investigate the performance of the word embedding techniques using 5 different machine learning classifiers i.e. Multinomial Naive Bayes (MNB), KNN, SVM, Random Forest and Extra Trees. Based on the dataset employed in the study, N-gram, BOW and TF-IDF with oversampling recorded the highest F1 scores of 0.99 for ham and 0.94 for spam.

[1]  M. Mikki,et al.  Spam Detection Using BERT , 2022, ArXiv.

[2]  S. Yerima,et al.  Semi-supervised novelty detection with one class SVM for SMS spam detection , 2022, 2022 29th International Conference on Systems, Signals and Image Processing (IWSSIP).

[3]  Vijay Srinivas Tida,et al.  Universal Spam Detection using Transfer Learning of BERT Model , 2022, HICSS.

[4]  Xin Tong,et al.  A Content-Based Chinese Spam Detection Method Using a Capsule Network With Long-Short Attention , 2021, IEEE Sensors Journal.

[5]  S. Gadde,et al.  SMS Spam Detection using Machine Learning and Deep Learning Techniques , 2021, 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS).

[6]  Amiya Nayak,et al.  A Spam Transformer Model for SMS Spam Detection , 2021, IEEE Access.

[7]  Abdallah Ghourabi,et al.  A Hybrid CNN-LSTM Model for SMS Spam Detection in Arabic and English Messages , 2020, Future Internet.

[8]  Habib Ullah Khan,et al.  Spam Detection Approach for Secure Mobile Message Communication Using Machine Learning Algorithms , 2020, Secur. Commun. Networks.

[9]  Devpriya Soni,et al.  Smishing Detector: A security model to detect smishing through SMS content analysis and URL behavior analysis , 2020, Future Gener. Comput. Syst..

[10]  Sefat E Rahman,et al.  Email Spam Detection using Bidirectional Long Short Term Memory with Convolutional Neural Network , 2020, 2020 IEEE Region 10 Symposium (TENSYMP).

[11]  Eduardo Fidalgo,et al.  Classification of Spam Emails through Hierarchical Clustering and Supervised Learning , 2020, ArXiv.

[12]  Pradeep Kumar Roy,et al.  Deep learning to filter SMS Spam , 2020, Future Gener. Comput. Syst..

[13]  Taihua Huang,et al.  A CNN Model for SMS Spam Detection , 2019, 2019 4th International Conference on Mechanical, Control and Computer Engineering (ICMCCE).

[14]  Kittisak Kerdprasop,et al.  SMS Spam Detection Based on Long Short-Term Memory and Gated Recurrent Unit , 2019, International Journal of Future Computer and Communication.

[15]  Soomro Pir Dino,et al.  LSTM Based Short Message Service (SMS) Modeling for Spam Classification , 2018, ICMLT '18.

[16]  Ankit Kumar Jain,et al.  Towards Filtering of SMS Spam Messages Using Machine Learning Based Technique , 2017 .

[17]  G. Shanmugasundaram,et al.  Investigation on social media spam detection , 2017, 2017 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS).

[18]  El-Sayed M. El-Alfy,et al.  Spam filtering framework for multimodal mobile communication based on dendritic cell algorithm , 2016, Future Gener. Comput. Syst..

[19]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[20]  Anirudh Harisinghaney,et al.  Text and image based spam email classification using KNN, Naïve Bayes and Reverse DBSCAN algorithm , 2014, 2014 International Conference on Reliability Optimization and Information Technology (ICROIT).

[21]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Gonzalo Álvarez,et al.  Word sense disambiguation for spam filtering , 2012, Electron. Commer. Res. Appl..

[24]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.