SMS spam filtering: Methods and data

Highlights? We motivate the need for content-based SMS spam filtering. ? We discuss similarities/differences between email and SMS spam filtering. ? We review recent research in SMS spam filtering. ? We analyse recent SMS spam messages and make a dataset available. ? Early days, no consensus yet on best techniques but significant challenges exist. Mobile or SMS spam is a real and growing problem primarily due to the availability of very cheap bulk pre-pay SMS packages and the fact that SMS engenders higher response rates as it is a trusted and personal service. SMS spam filtering is a relatively new task which inherits many issues and solutions from email spam filtering. However it poses its own specific challenges. This paper motivates work on filtering SMS spam and reviews recent developments in SMS spam filtering. The paper also discusses the issues with data collection and availability for furthering research in this area, analyses a large corpus of SMS spam, and provides some initial benchmark results.

[1]  Jason D. M. Rennie ifile: An Application of Machine Learning to E-Mail Filtering , 2000 .

[2]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[3]  Uzma Maroof,et al.  Analysis and detection of SPIM using message statistics , 2010, 2010 6th International Conference on Emerging Technologies (ICET).

[4]  François Yvon,et al.  Normalizing SMS: are Two Metaphors Better than One ? , 2008, COLING.

[5]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[6]  Cédrick Fairon,et al.  A Hybrid Rule/Model-Based Finite-State Framework for Normalizing SMS Messages , 2010, ACL.

[7]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[8]  Muddassar Farooq,et al.  Using evolutionary learning classifiers to do MobileSpam (SMS) filtering , 2011, GECCO '11.

[9]  Danah Boyd,et al.  Detecting Spam in a Twitter Network , 2009, First Monday.

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Na Li,et al.  Detecting and filtering instant messaging spam - a global and personalized approach , 2005, 1st IEEE ICNP Workshop on Secure Network Protocols, 2005. (NPSec)..

[12]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[13]  Chinya V. Ravishankar,et al.  LOHIT : AN ONLINE DETECTION & CONTROL SYSTEM FOR CELLULAR SMS SPAM , 2005 .

[14]  Min-Yen Kan Optimizing predictive text entry for short message service on mobile phones 1 , 2005 .

[15]  Ting Wang,et al.  Index-based Online Text Classification for SMS Spam Filtering , 2010, J. Comput..

[16]  Steven Gianvecchio,et al.  Measurement and Classification of Humans and Bots in Internet Chat , 2008, USENIX Security Symposium.

[17]  Jung-Tae Lee,et al.  The Contribution of Stylistic Information to Content-based Mobile Spam Filtering , 2009, ACL.

[18]  Cédrick Fairon,et al.  A translated corpus of 30,000 French SMS , 2006, LREC.

[19]  Ming-Syan Chen,et al.  Incremental SVM Model for Spam Detection on Dynamic Email Social Networks , 2009, 2009 International Conference on Computational Science and Engineering.

[20]  Yang Xiang,et al.  Filtering mobile spam by support vector machine , 2004 .

[21]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[22]  Mingguang Wu,et al.  Real-time monitoring and filtering system for mobile SMS , 2008, 2008 3rd IEEE Conference on Industrial Electronics and Applications.

[23]  Gianluca Stringhini,et al.  Detecting spammers on social networks , 2010, ACSAC '10.

[24]  Jun Ho Huh,et al.  Hybrid spam filtering for mobile communication , 2009, Comput. Secur..

[25]  José María Gómez Hidalgo,et al.  Content based SMS spam filtering , 2006, DocEng '06.

[26]  Caroline Tagg,et al.  A corpus linguistics study of SMS text messaging , 2009 .

[27]  Hong Peng,et al.  Research on a Naive Bayesian Based Short Message Filtering System , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[28]  Chen Wang,et al.  A behavior-based SMS antispam system , 2010, IBM J. Res. Dev..

[29]  Chih-Chin Lai,et al.  An empirical study of three machine learning methods for spam filtering , 2007, Knowl. Based Syst..

[30]  Nan Li,et al.  A New Spam Short Message Classification , 2009, 2009 First International Workshop on Education Technology and Computer Science.

[31]  Wei Zheng,et al.  A Novel Method for Filtering Group Sending Short Message Spam , 2008, 2008 International Conference on Convergence and Hybrid Information Technology.

[32]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[33]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[34]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[35]  Yuezhong Tang,et al.  Spam Filter for Short Messages Using Winnow , 2008, 2008 International Conference on Advanced Language Processing and Web Information Technology.

[36]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[37]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[38]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[39]  Mirella Lapata,et al.  Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics , 1999, ACL 1999.

[40]  Venkat Venkatakrishnan,et al.  Twenty-Sixth Annual Computer Security Applications Conference, ACSAC 2010, Austin, Texas, USA, 6-10 December 2010 , 2010, ACSAC.

[41]  Sarah Jane Delany,et al.  An Assessment of Case Base Reasoning for Short Text Message Classification , 2004 .

[42]  Jie Huang,et al.  A Bayesian Approach for Text Filter on 3G Network , 2010, 2010 6th International Conference on Wireless Communications Networking and Mobile Computing (WiCOM).

[43]  Dit-Yan Yeung,et al.  A learning approach to spam detection based on social networks , 2007 .

[44]  Vinayak S. Naik,et al.  SMSAssassin: crowdsourcing driven mobile-based system for SMS spam filtering , 2011, HotMobile '11.

[45]  Tim Shortis,et al.  The Language of ICT , 2001 .

[46]  Gordon V. Cormack,et al.  Spam filtering for short messages , 2007, CIKM '07.

[47]  Rich Ling,et al.  The Sociolinguistics of SMS: An Analysis of SMS Use by a Random Sample of Norwegians , 2005 .

[48]  Sushil Jajodia,et al.  Who is tweeting on Twitter: human, bot, or cyborg? , 2010, ACSAC '10.

[49]  Fu Yan,et al.  Sampling of Mass SMS Filtering Algorithm Based on Frequent Time-domain Area , 2010, 2010 Third International Conference on Knowledge Discovery and Data Mining.

[50]  P. Oscar Boykin,et al.  Leveraging social networks to fight spam , 2005, Computer.

[51]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[52]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, CCS '10.

[53]  Weidong Fang,et al.  Adaptive Spam Filtering Based on Fingerprint Vectors , 2008, 2008 ISECS International Colloquium on Computing, Communication, Control, and Management.