On the Validity of a New SMS Spam Collection

Mobile phones are becoming the latest target of electronic junk mail. Recent reports clearly indicate that the volume of SMS spam messages are dramatically increasing year by year. Probably, one of the major concerns in academic settings was the scarcity of public SMS spam datasets, that are sorely needed for validation and comparison of different classifiers. To address this issue, we have recently proposed a new SMS Spam Collection that, to the best of our knowledge, is the largest, public and real SMS dataset available for academic studies. However, as it has been created by augmenting a previously existing database built using roughly the same sources, it is sensible to certify that there are no duplicates coming from them. So, in this paper we offer a comprehensive analysis of the new SMS Spam Collection in order to ensure that this does not happen, since it may ease the task of learning SMS spam classifiers and, hence, it could compromise the evaluation of methods. The analysis of results indicate that the procedure followed does not lead to near-duplicates and, consequently, the proposed dataset is reliable to use for evaluating and comparing the performance achieved by different classifiers.

[1]  Akebo Yamakami,et al.  Content-based spam filtering , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[2]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[3]  Vaclav Snasel,et al.  Survey of Plagiarism Detection Methods , 2011, 2011 Fifth Asia Modelling Symposium.

[4]  Akebo Yamakami,et al.  Facing the spammers: A very effective approach to avoid junk e-mails , 2012, Expert Syst. Appl..

[5]  Jurandy Almeida,et al.  Spam filtering: how the dimensionality reduction affects the accuracy of Naive Bayes classifiers , 2011, Journal of Internet Services and Applications.

[6]  José María Gómez Hidalgo,et al.  Content based SMS spam filtering , 2006, DocEng '06.

[7]  Gordon V. Cormack,et al.  Spam filtering for short messages , 2007, CIKM '07.

[8]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[9]  Gordon V. Cormack,et al.  Feature engineering for mobile (SMS) spam filtering , 2007, SIGIR.

[10]  Jurandy Almeida,et al.  Evaluation of Approaches for Dimensionality Reduction Applied with Naive Bayes Anti-Spam Filters , 2009, 2009 International Conference on Machine Learning and Applications.

[11]  Jurandy Almeida,et al.  Probabilistic anti-spam filtering with dimensionality reduction , 2010, SAC '10.

[12]  Jurandy Almeida,et al.  Filtering spams using the minimum description length principle , 2010, SAC '10.