A weighted feature enhanced Hidden Markov Model for spam SMS filtering

Abstract Short message service (SMS) is a most favored communication service people use in daily life. However, this service is being misused by spammers. Rule based systems (RBS) and content based filtering (CBF) techniques have been developed to filter out spam messages. New rules can be easily added into RBS, but the throughput usually reduces as the rules increase. The bag-of-words (BoW) assumption based CBF techniques ignore the word order, which use machine learning methods to extract features from SMS message body according to word frequency and distribution. Striving to improve performance, researchers developed hybrid models that made algorithms ever-more complex. In addition, frequently conducting the time consuming models training and deployment force the anti-spam industry still rely mainly on rule-based systems with unsolved throughput issue. A discrete hidden Markov model (HMM) was proposed in our previous study to address these issues, and the HMM method achieved a comparable performance to the deep learning methods. To further improve the performance of HMM method, we propose a new approach to weight and label words in SMS for formatting the observation sequence in HMM method. The weighted feature enhanced HMM achieves higher accuracy, and much faster training and filtering speed for meeting the anti-spam industry requirement. The performance comparison with other machine learning methods is conducted on the same open respiratory data set maintained by University of California, Irvine (UCI). Experimental results show that the weighted features enhanced HMM outperforms the LSTM (long short-term memory model) and close to CNN (convolutional neural network) in terms of classification accuracy. In addition, a Chinese SMS data set is used to further validate filtering accuracy and filtering speed.

[1]  Pradeep Kumar Roy,et al.  Deep learning to filter SMS Spam , 2020, Future Gener. Comput. Syst..

[2]  Fawaz S. Al-Anzi,et al.  Beyond vector space model for hierarchical Arabic text classification: A Markov chain approach , 2018, Inf. Process. Manag..

[3]  S. Sheikhi,et al.  An Effective Model for SMS Spam Detection Using Content-based Features and Averaged Neural Network , 2020 .

[4]  Abdelhak Zoglat,et al.  Classification with hidden Markov model , 2014 .

[5]  Cheng Hua Li,et al.  Spam filtering using semantic similarity approach and adaptive BPNN , 2012, Neurocomputing.

[6]  Kichun Lee,et al.  Opinion mining using ensemble text hidden Markov models for text classification , 2018, Expert Syst. Appl..

[7]  Fardin Ahmadizar,et al.  A novel multivariate filter method for feature selection in text classification problems , 2018, Eng. Appl. Artif. Intell..

[8]  Sanjay Misra,et al.  A review of soft techniques for SMS spam classification: Methods, approaches and applications , 2019, Eng. Appl. Artif. Intell..

[9]  Abdallah Ghourabi,et al.  A Hybrid CNN-LSTM Model for SMS Spam Detection in Arabic and English Messages , 2020, Future Internet.

[10]  Xiaohui Liu,et al.  An N-State Markovian Jumping Particle Swarm Optimization Algorithm , 2020, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[11]  Nurulhuda Firdaus Mohd Azmi,et al.  SMS Spam Message Detection using Term Frequency-Inverse Document Frequency and Random Forest Algorithm , 2019, Procedia Computer Science.

[12]  Abdelwahab Hamou-Lhadj,et al.  An HMM-based approach for automatic detection and classification of duplicate bug reports , 2019, Inf. Softw. Technol..

[13]  Adamu I. Abubakar,et al.  A Review on Mobile SMS Spam Filtering Techniques , 2017, IEEE Access.

[14]  Georgios Paliouras,et al.  Graph vs. bag representation models for the topic classification of web documents , 2016, World Wide Web.

[15]  Xuemin Chen,et al.  A Discrete Hidden Markov Model for SMS Spam Detection , 2020, Applied Sciences.

[16]  Fuad E. Alsaadi,et al.  Deep-reinforcement-learning-based images segmentation for quantitative analysis of gold immunochromatographic strip , 2020, Neurocomputing.

[17]  Alper Uysal,et al.  A novel term weighting scheme for text classification: TF-MONO , 2020, J. Informetrics.

[18]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[19]  Eduardo Conde,et al.  An HMM for detecting spam mail , 2007, Expert Syst. Appl..

[20]  Tian Xia,et al.  An improvement to TF-IDF: Term Distribution based Term Weight Algorithm , 2011, J. Softw..

[21]  Florentino Fernández Riverola,et al.  Using evolutionary computation for discovering spam patterns from e-mail samples , 2018, Inf. Process. Manag..

[22]  Bruno Trstenjak,et al.  on Intelligent Manufacturing and Automation , 2013 KNN with TF-IDF Based Framework for Text Categorization , 2014 .

[23]  Tian Xia,et al.  A Constant Time Complexity Spam Detection Algorithm for Boosting Throughput on Rule-Based Filtering Systems , 2020, IEEE Access.

[24]  Ankit Kumar Jain,et al.  Rule-Based Framework for Detection of Smishing Messages in Mobile Environment , 2018 .

[25]  Bi-Min Hsu,et al.  Comparison of Supervised Classification Models on Textual Data , 2020, Mathematics.

[26]  Shengyi Jiang,et al.  An improved K-nearest-neighbor algorithm for text categorization , 2012, Expert Syst. Appl..

[27]  Neeraj Kumar,et al.  An efficient deep learning-based scheme for web spam detection in IoT environment , 2020, Future Gener. Comput. Syst..

[28]  Yuan Yuan,et al.  A Novel Sigmoid-Function-Based Adaptive Weighted Particle Swarm Optimizer , 2019, IEEE Transactions on Cybernetics.

[29]  Aakanksha Sharaff,et al.  SMS spam filtering and thread identification using bi-level text classification and clustering techniques , 2017, J. Inf. Sci..

[30]  Hamid Turab Mirza,et al.  Spam Review Detection Techniques: A Systematic Literature Review , 2019, Applied Sciences.

[31]  Haruna Chiroma,et al.  Machine learning for email spam filtering: review, approaches and open research problems , 2019, Heliyon.

[32]  Florence Sèdes,et al.  A Topic-Based Hidden Markov Model for Real-Time Spam Tweets Filtering , 2017, KES.

[33]  Jian Yu,et al.  Concept decompositions for short text clustering by identifying word communities , 2018, Pattern Recognit..

[34]  Fuad E. Alsaadi,et al.  A novel randomised particle swarm optimizer , 2020, Int. J. Mach. Learn. Cybern..

[35]  Xiaohui Liu,et al.  An optimally weighted user- and item-based collaborative filtering approach to predicting baseline data for Friedreich's Ataxia patients , 2021, Neurocomputing.

[36]  Naresh Kumar Nagwani A Bi-Level Text Classification Approach for SMS Spam Filtering and Identifying Priority Messages , 2016 .

[37]  Turgay Çelik,et al.  Unsupervised feature learning for spam email filtering , 2019, Comput. Electr. Eng..

[38]  Aliaksandr Barushka,et al.  Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks , 2018, Applied Intelligence.

[39]  Florentino Fernández Riverola,et al.  Effective scheduling strategies for boosting performance on rule-based spam filtering frameworks , 2013, J. Syst. Softw..

[40]  Yang Chen Mining of instant messaging data in the Internet of Things based on support vector machine , 2020, Comput. Commun..

[41]  Erzhou Zhu,et al.  DTOF-ANN: An Artificial Neural Network phishing detection model based on Decision Tree and Optimal Features , 2020, Appl. Soft Comput..

[42]  Zidong Wang,et al.  A Novel Particle Swarm Optimization Approach for Patient Clustering From Emergency Departments , 2019, IEEE Transactions on Evolutionary Computation.

[43]  Kittisak Kerdprasop,et al.  SMS Spam Detection Based on Long Short-Term Memory and Gated Recurrent Unit , 2019, International Journal of Future Computer and Communication.

[44]  Sean R Eddy,et al.  What is a hidden Markov model? , 2004, Nature Biotechnology.

[45]  Zenun Kastrati,et al.  Performance analysis of machine learning classifiers on improved concept vector space models , 2019, Future Gener. Comput. Syst..

[46]  Kee-Eung Kim,et al.  An Improved Particle Filter With a Novel Hybrid Proposal Distribution for Quantitative Analysis of Gold Immunochromatographic Strips , 2019, IEEE Transactions on Nanotechnology.

[47]  Jie Yang,et al.  Coverage and Energy Efficiency Analysis for Two-Tier Heterogeneous Cellular Networks Based on Matérn Hard-Core Process , 2019, Future Internet.

[48]  Zidong Wang,et al.  A Dynamic Neighborhood-Based Switching Particle Swarm Optimization Algorithm , 2020, IEEE Transactions on Cybernetics.