Towards SMS Spam Filtering: Results under a New Dataset

The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. Recent reports clearly indicate that the volume of mobile phone spam is dramatically increasing year by year. In practice, fighting such plague is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. Probably, one of the major concerns in academic settings is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, traditional content-based filters may have their performance seriously degraded since SMS messages are fairly short and their text is generally rife with idioms and abbreviations. In this paper, we present details about a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we offer a comprehensive analysis of such dataset in order to ensure that there are no duplicated messages coming from previously existing datasets, since it may ease the task of learning SMS spam classifiers and could compromise the evaluation of methods. Additionally, we compare the performance achieved by several established machine learning techniques. Im summary, the results indicate that the procedure followed to build the collection does not lead to near-duplicates and, regarding the classifiers, the Support Vector Machines outperforms other evaluated techniques and, hence, it can be used as a good baseline for further comparison.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  William S. Yerazunis,et al.  Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering , 2004, PKDD.

[3]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[4]  Jurandy Almeida,et al.  Filtering spams using the minimum description length principle , 2010, SAC '10.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[7]  José María Gómez Hidalgo,et al.  Content based SMS spam filtering , 2006, DocEng '06.

[8]  Hector Garcia-Molina,et al.  SCAM: A Copy Detection Mechanism for Digital Documents , 1995, DL.

[9]  Myeong-Kwan Kevin Cheon,et al.  Frank and I , 2012 .

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Patrick P. K. Chan,et al.  Spam filtering for short messages in adversarial environment , 2015, Neurocomputing.

[12]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[13]  Zhang Ling,et al.  A Cluster-Based Plagiarism Detection Method - Lab Report for PAN at CLEF 2010 , 2010, CLEF.

[14]  George Forman,et al.  Feature shaping for linear SVM classifiers , 2009, KDD.

[15]  S. J. Press,et al.  Choosing between Logistic Regression and Discriminant Analysis , 1978 .

[16]  Akebo Yamakami,et al.  Content-based spam filtering , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[17]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[18]  Vaclav Snasel,et al.  Survey of Plagiarism Detection Methods , 2011, 2011 Fifth Asia Modelling Symposium.

[19]  Qiang Yang,et al.  SMS Spam Detection Using Noncontent Features , 2012, IEEE Intelligent Systems.

[20]  Baris Coskun,et al.  Mitigating SMS spam by online detection of repetitive near-duplicate messages , 2012, 2012 IEEE International Conference on Communications (ICC).

[21]  Jurandy Almeida,et al.  Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers , 2011, Journal of Internet Services and Applications.

[22]  Jack G. Conrad,et al.  Constructing a text corpus for inexact duplicate detection , 2004, SIGIR '04.

[23]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[24]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[25]  Gordon V. Cormack,et al.  Feature engineering for mobile (SMS) spam filtering , 2007, SIGIR.

[26]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[27]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[28]  Paolo Rosso,et al.  Detection of near-duplicate user generated contents: the SMS spam collection , 2011, SMUC '11.

[29]  Ting Wang,et al.  Index-based Online Text Classification for SMS Spam Filtering , 2010, J. Comput..

[30]  Alexander F. Gelbukh,et al.  PPChecker: Plagiarism Pattern Checker in Document Copy Detection , 2006, TSD.

[31]  Akebo Yamakami,et al.  Facing the spammers: A very effective approach to avoid junk e-mails , 2012, Expert Syst. Appl..

[32]  Jurandy Almeida,et al.  Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters , 2009, 2009 International Conference on Machine Learning and Applications.

[33]  Jurandy Almeida,et al.  Probabilistic anti-spam filtering with dimensionality reduction , 2010, SAC '10.

[34]  Sarah Jane Delany,et al.  SMS spam filtering: Methods and data , 2012, Expert Syst. Appl..

[35]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[36]  Deokjai Choi,et al.  Simple SMS spam filtering on independent mobile phone , 2012, Secur. Commun. Networks.

[37]  Wen-tau Yih,et al.  Adaptive near-duplicate detection via similarity learning , 2010, SIGIR.

[38]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[39]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[40]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[41]  Akebo Yamakami,et al.  On the Validity of a New SMS Spam Collection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[42]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[43]  Jung-San Lee,et al.  An interactive mobile SMS confirmation method using secret sharing technique , 2011, Comput. Secur..

[44]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[45]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[46]  Hector Garcia-Molina,et al.  Copy detection mechanisms for digital documents , 1995, SIGMOD '95.

[47]  José María Gómez Hidalgo,et al.  Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.