Contributions to the study of SMS spam filtering: new collection and results

The growth of mobile phone users has lead to a dramatic increasing of SMS spam messages. In practice, fighting mobile phone spam is difficult by several factors, including the lower rate of SMS that has allowed many users and service providers to ignore the issue, and the limited availability of mobile phone spam-filtering software. On the other hand, in academic settings, a major handicap is the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. Moreover, as SMS messages are fairly short, content-based spam filters may have their performance degraded. In this paper, we offer a new real, public and non-encoded SMS spam collection that is the largest one as far as we know. Moreover, we compare the performance achieved by several established machine learning methods. The results indicate that Support Vector Machine outperforms other evaluated classifiers and, hence, it can be used as a good baseline for further comparison.

[1]  Ian H. Witten,et al.  Generating Accurate Rule Sets Without Global Optimization , 1998, ICML.

[2]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[3]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[4]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[5]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[6]  Jurandy Almeida,et al.  Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters , 2009, 2009 International Conference on Machine Learning and Applications.

[7]  Jurandy Almeida,et al.  Filtering spams using the minimum description length principle , 2010, SAC '10.

[8]  Heinz Dreher,et al.  Issues in Informing Science and Information Technology Automatic Conceptual Analysis for Plagiarism Detection , 2022 .

[9]  Myeong-Kwan Kevin Cheon,et al.  Frank and I , 2012 .

[10]  Competitor enriquevallesbalaguer Putting Ourselves in SME’s Shoes: Automatic Detection of Plagiarism by the WCopyFind tool , 2009 .

[11]  Jurandy Almeida,et al.  Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers , 2011, Journal of Internet Services and Applications.

[12]  José María Gómez Hidalgo,et al.  Content based SMS spam filtering , 2006, DocEng '06.

[13]  George Forman,et al.  Feature shaping for linear SVM classifiers , 2009, KDD.

[14]  William S. Yerazunis,et al.  Combining Winnow and Orthogonal Sparse Bigrams for Incremental Spam Filtering , 2004, PKDD.

[15]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[18]  José María Gómez Hidalgo,et al.  Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[19]  Gordon V. Cormack,et al.  Spam filtering for short messages , 2007, CIKM '07.

[20]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[21]  Gordon V. Cormack,et al.  Feature engineering for mobile (SMS) spam filtering , 2007, SIGIR.