SMS spam filtering and thread identification using bi-level text classification and clustering techniques

SMS spam detection is an important task where spam SMS messages are identified and filtered. As greater numbers of SMS messages are communicated every day, it is very difficult for a user to remember and correlate the newer SMS messages received in context to previously received SMS. SMS threads provide a solution to this problem. In this work the problem of SMS spam detection and thread identification is discussed and a state of the art clustering-based algorithm is presented. The work is planned in two stages. In the first stage the binary classification technique is applied to categorize SMS messages into two categories namely, spam and non-spam SMS; then, in the second stage, SMS clusters are created for non-spam SMS messages using non-negative matrix factorization and K-means clustering techniques. A threading-based similarity feature, that is, time between consecutive communications, is described for the identification of SMS threads, and the impact of the time threshold in thread identification is also analysed experimentally. Performance parameters like accuracy, precision, recall and F-measure are also evaluated. The SMS threads identified in this proposed work can be used in applications like SMS thread summarization, SMS folder classification and other SMS management-related tasks.

[1]  Qian Wang,et al.  Studying of Classifying Junk Messages Based on The Data Mining , 2009, 2009 International Conference on Management and Service Science.

[2]  Deokjai Choi,et al.  Independent and Personal SMS Spam Filtering , 2011, 2011 IEEE 11th International Conference on Computer and Information Technology.

[3]  Azadeh Shakery,et al.  An Evolutionary-Based Method for Reconstructing Conversation Threads in Email Corpora , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[4]  Ani Nenkova,et al.  Facilitating email thread access by extractive summary generation , 2003, RANLP.

[5]  I. Androulidakis,et al.  Spam goes mobile: Filtering unsolicited SMS traffic , 2012, 2012 20th Telecommunications Forum (TELFOR).

[6]  Guoxiang Liu,et al.  The application of data mining in the classification of spam messages , 2012, 2012 International Conference on Computer Science and Information Processing (CSIP).

[7]  Prateek Saxena,et al.  The curse of 140 characters: evaluating the efficacy of SMS spam detection on android , 2013, SPSM '13.

[8]  Heshaam Faili,et al.  A Supervised Approach for Reconstructing Thread Structure in Comments on Blogs and Online News Agencies (El enfoque supervisado para reconstrucción de la estructura de hilos en comentarios en blogs y agencias de noticias en línea) , 2013, Computación y Sistemas.

[9]  Semih Ergin,et al.  The Impact of Feature Extraction and Selection on SMS Spam Filtering , 2013 .

[10]  Owen Rambow,et al.  Summarizing Email Threads , 2004, NAACL.

[11]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[12]  Yan Jia,et al.  Conversation Extraction in Dynamic Text Message Stream , 2008, J. Comput..

[13]  Pablo Ariel Duboué Extractive email thread summarization: Can we do better than He Said She Said? , 2012, INLG.

[14]  Peter Kent,et al.  Identifying clinical course patterns in SMS data using cluster analysis , 2012, Chiropractic & Manual Therapies.

[15]  Aakanksha Sharaff,et al.  Email thread identification using latent Dirichlet allocation and non-negative matrix factorization based clustering techniques , 2016, J. Inf. Sci..

[16]  Qiang Yang,et al.  Thread detection in dynamic text message streams , 2006, SIGIR.

[17]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[18]  Qiang Yang,et al.  SMS Spam Detection Using Noncontent Features , 2012, IEEE Intelligent Systems.

[19]  Jimmy J. Lin,et al.  Single-document and multi-document summarization techniques for email threads using sentence compression , 2008, Inf. Process. Manag..

[20]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[21]  R. Parimala,et al.  A Study on Analysis of SMS Classification Using Document Frequency Thresold , 2012 .

[22]  Roger Wattenhofer,et al.  BuzzTrack: topic detection and tracking in email , 2007, IUI '07.

[23]  Sarah Jane Delany,et al.  SMS spam filtering: Methods and data , 2012, Expert Syst. Appl..

[24]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[25]  Tiago A. Almeida,et al.  Towards SMS Spam Filtering: Results under a New Dataset , 2013 .

[26]  Jun Ho Huh,et al.  Hybrid spam filtering for mobile communication , 2009, Comput. Secur..

[27]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[28]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[29]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[30]  Vinayak S. Naik,et al.  SMSAssassin: crowdsourcing driven mobile-based system for SMS spam filtering , 2011, HotMobile '11.

[31]  Serkan Günal,et al.  Detection of SMS spam messages on mobile phones , 2012, 2012 20th Signal Processing and Communications Applications Conference (SIU).

[32]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[33]  Ting Wang,et al.  Index-based Online Text Classification for SMS Spam Filtering , 2010, J. Comput..

[34]  Micah Sherr,et al.  $100,000 prize jackpot. call now!: identifying the pertinent features of SMS spam , 2012, SIGIR '12.

[35]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[36]  Bernard Kerr Thread Arcs: an email thread visualization , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[37]  S. Ergin,et al.  A novel framework for SMS spam filtering , 2012, 2012 International Symposium on Innovations in Intelligent Systems and Applications.

[38]  Baruch B. Schwarz,et al.  Online moderation of synchronous e-argumentation , 2010, Int. J. Comput. Support. Collab. Learn..

[39]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[40]  Gordon V. Cormack,et al.  Spam filtering for short messages , 2007, CIKM '07.

[41]  Stephen Wan,et al.  Generating Overview Summaries of Ongoing Email Thread Discussions , 2004, COLING.

[42]  Bahram Ranjbarian,et al.  Attitude toward sms advertising and derived behavioral intension, an empirical study using TPB (SEM method) , 2014 .

[43]  Sureswaran Ramadass,et al.  Employing machine learning algorithms to detect unknown scanning and email worms , 2014, Int. Arab J. Inf. Technol..

[44]  Muddassar Farooq,et al.  Using evolutionary learning classifiers to do MobileSpam (SMS) filtering , 2011, GECCO '11.

[45]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[46]  Wagner Meira,et al.  Neighborhoods and bands: an analysis of the origins of spam , 2015, Journal of Internet Services and Applications.

[47]  Jen-Yuan Yeh,et al.  Email Thread Reassembly Using Similarity Matching , 2006, CEAS.

[48]  Akebo Yamakami,et al.  On the Validity of a New SMS Spam Collection , 2012, 2012 11th International Conference on Machine Learning and Applications.

[49]  Azadeh Shakery,et al.  A learning approach for email conversation thread reconstruction , 2013, J. Inf. Sci..

[50]  Kenney Ng,et al.  Auto-grouping emails for faster e-discovery , 2011, Proc. VLDB Endow..

[51]  Xia Wang,et al.  Email Conversations Reconstruction Based on Messages Threading for Multi-person , 2008, 2008 International Workshop on Education Technology and Training & 2008 International Workshop on Geoscience and Remote Sensing.