Resolving the imbalance issue in short messaging service spam dataset using cost-sensitive techniques

Abstract Mobile spam messages have become one of the main concerns in the field of short messaging service (SMS) due to its negative impact on mobile users and networks. The current literature lacks effective solutions for this issue. In this study, the negative impacts of SMS spam were thoroughly analysed, and the existing techniques for SMS spam detection were investigated through two experiments. The first experiment was performed to test and compare the current data mining and cost-sensitive techniques, whereas the second experiment was conducted to test the performance of the proposed technique. Based on the experimental results of the first phase, the most optimal non-cost classifier is a Bayesian network classifier, which is well behaved under the cost-sensitive classifier and obtained the lowest rate of false negative and an acceptable false positive rate. The proposed strategy achieves the best performance in terms of false negative SMS spam classification, obtaining the smallest total expenses and highest precision amongst the compared strategies.

[1]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[2]  Megha Rathi,et al.  Spam Mail Detection through Data Mining – A Comparative Performance Analysis , 2013 .

[3]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[4]  Xi Chen,et al.  Assessing the severity of phishing attacks: A hybrid data mining approach , 2011, Decis. Support Syst..

[5]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[6]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[7]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[8]  Donghai Guan,et al.  SMS Classification Based on Naïve Bayes Classifier and Apriori Algorithm Frequent Itemset , 2014 .

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[11]  Nazanin Firoozeh,et al.  Definition of spam 2.0: New spamming boom , 2010, 4th IEEE International Conference on Digital Ecosystems and Technologies.

[12]  Bernard F. Buxton,et al.  Performance Degradation in Boosting , 2001, Multiple Classifier Systems.

[13]  Xue Li,et al.  Counterfeiting Detection in RFID-enabled Supply Chain , 2013, IOT 2013.

[14]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[15]  M. W Gardner,et al.  Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences , 1998 .

[16]  Paul M. Mather,et al.  An assessment of the effectiveness of decision tree methods for land cover classification , 2003 .

[17]  Tyler Moore,et al.  Temporal Correlations between Spam and Phishing Websites , 2009, LEET.

[18]  Xue Li,et al.  A Cost-based Model for Risk Management in RFID-Enabled Supply Chain Applications , 2011 .

[19]  Yoram Singer,et al.  Boosting and Rocchio applied to text filtering , 1998, SIGIR '98.

[20]  Victor S. Sheng,et al.  Cost-Sensitive Learning and the Class Imbalance Problem , 2008 .

[21]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[22]  Haiyi Zhang,et al.  Naïve Bayes Text Classifier , 2007 .

[23]  Liang Chen,et al.  TruSMS: A trustworthy SMS spam control system based on trust management , 2015, Future Gener. Comput. Syst..

[24]  Sarah Jane Delany,et al.  SMS spam filtering: Methods and data , 2012, Expert Syst. Appl..

[25]  Tiago A. Almeida,et al.  Towards SMS Spam Filtering: Results under a New Dataset , 2013 .

[26]  Victor S. Sheng,et al.  Thresholding for Making Classifiers Cost-sensitive , 2006, AAAI.

[27]  Salvatore J. Stolfo,et al.  AdaCost: Misclassification Cost-Sensitive Boosting , 1999, ICML.

[28]  Daniel M. Dunlavy,et al.  HETEROGENEOUS ENSEMBLE CLASSIFICATION , 2009 .

[29]  M. H. Shirali Shahreza,et al.  An Anti-SMS-Spam Using CAPTCHA , 2008 .

[30]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[31]  Yang Wang,et al.  Cost-sensitive boosting for classification of imbalanced data , 2007, Pattern Recognit..

[32]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[33]  John Zahorjan,et al.  The challenges of mobile computing , 1994, Computer.

[34]  Nils J. Nilsson,et al.  Artificial Intelligence: A New Synthesis , 1997 .

[35]  Silvio Savarese,et al.  Comparing image classification methods: K-nearest-neighbor and support-vector-machines , 2012 .

[36]  Konstantin Tretyakov,et al.  Machine Learning Techniques in Spam Filtering , 2004 .

[37]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[38]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[39]  Lars Schmidt-Thieme,et al.  Cost-sensitive learning methods for imbalanced data , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[40]  C. Apte,et al.  Data mining with decision trees and decision rules , 1997, Future Gener. Comput. Syst..

[41]  David Heckerman,et al.  Bayesian Networks for Data Mining , 2004, Data Mining and Knowledge Discovery.