The curse of 140 characters: evaluating the efficacy of SMS spam detection on android

Many applications are available on Android market place for SMS spam filtering. In this paper, we conduct a detailed study of the methods used in spam filtering in these applications by reverse engineering them. Our study has three parts. First, we perform empirical tests to valuate accuracy and precision of these apps. Second, we test if we can use email spam classifiers on short text messages effectively. Empirical test results show that these email spam classifiers do not yield optimal accuracy (like they do on emails) when used with SMS data. Finally, in this work we develop a two-level stacked classifier for short text messages and demonstrate the improvement in accuracy over traditional Bayesian email spam filters. Our experimental results show that spam filtering precision and accuracy of nearly 98% (which is comparable with those of email classifiers) can be obtained using the stacked classifier we develop.

[1]  Travis Earl Russell,et al.  Signaling System #7 , 1995 .

[2]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[3]  Georgios Paliouras,et al.  Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[4]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[5]  Jeff Hewett,et al.  Signaling System No. 7 (SS7/C7): Protocol, Architecture, and Applications , 2003 .

[6]  Yang Xiang,et al.  Filtering mobile spam by support vector machine , 2004 .

[7]  Thomas F. La Porta,et al.  Exploiting open functionality in SMS-capable cellular networks , 2005, CCS '05.

[8]  José María Gómez Hidalgo,et al.  Content based SMS spam filtering , 2006, DocEng '06.

[9]  Gordon V. Cormack,et al.  Feature engineering for mobile (SMS) spam filtering , 2007, SIGIR.

[10]  Qiang Yang,et al.  Deep classification in large-scale text hierarchies , 2008, SIGIR '08.

[11]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[12]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[13]  Xi Ma,et al.  Combining Naive Bayes and Tri-gram Language Model for Spam Filtering , 2011 .

[14]  Deokjai Choi,et al.  Independent and Personal SMS Spam Filtering , 2011, 2011 IEEE 11th International Conference on Computer and Information Technology.

[15]  Vinayak S. Naik,et al.  SMSAssassin: crowdsourcing driven mobile-based system for SMS spam filtering , 2011, HotMobile '11.

[16]  Derek Greene,et al.  SMS spam ltering: Methods and Data , 2011 .

[17]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[18]  Qiang Yang,et al.  SMS Spam Detection Using Noncontent Features , 2012, IEEE Intelligent Systems.

[19]  Roger Piqueras Jover,et al.  Crime scene investigation: SMS spam data analysis , 2012, IMC '12.

[20]  Sarah Jane Delany,et al.  SMS spam filtering: Methods and data , 2012, Expert Syst. Appl..

[21]  Tao Chen,et al.  Creating a live, public short message service corpus: the NUS SMS corpus , 2011, Lang. Resour. Evaluation.

[22]  Nan Jiang,et al.  Greystar : Fast and Accurate Detection of SMS Spam Numbers in Large Cellular Networks using Grey Phone Space , 2013 .

[23]  Patrick Traynor,et al.  MAST: triage for market-scale mobile malware analysis , 2013, WiSec '13.