Semi-supervised learning using frequent itemset and ensemble learning for SMS classification

We have used semi-supervised learning with the help of frequent itemset and ensemble learning to classify SMS data into ham and spam.We have used UCI publicly available SMS spam collection, SMS spam collection corpus v.0.1 small and big data set for experimenting our result.We have compared our result with existing semi-supervised learning methods PEBL and SpyEM.We have obtained good results on very low amount of positive dataset and different amount of unlabeled dataset. Short Message Service (SMS) has become one of the most important media of communications due to the rapid increase of mobile users and it's easy to use operating mechanism. This flood of SMS goes with the problem of spam SMS that are generated by spurious users. The detection of spam SMS has gotten more attention of researchers in recent times and is treated with a number of different machine learning approaches. Supervised machine learning approaches, used so far, demands a large amount of labeled data which is not always available in real applications. The traditional semi-supervised methods can alleviate this problem but may not produce good results if they are provided with only positive and unlabeled data. In this paper, we have proposed a novel semi-supervised learning method which makes use of frequent itemset and ensemble learning ( FIEL ) to overcome this limitation. In this approach, Apriori algorithm has been used for finding the frequent itemset while Multinomial Naive Bayes, Random Forest and LibSVM are used as base learners for ensemble learning which uses majority voting scheme. Our proposed approach works well with small number of positive data and different amounts of unlabeled dataset with higher accuracy. Extensive experiments have been conducted over UCI SMS spam collection data set, SMS spam collection Corpus v.0.1 Small and Big which show significant improvements in accuracy with very small amount of positive data. We have compared our proposed FIEL approach with the existing SPY-EM and PEBL approaches and the results show that our approach is more stable than the compared approaches with minimum support.

[1]  Qian Wang,et al.  Studying of Classifying Junk Messages Based on The Data Mining , 2009, 2009 International Conference on Management and Service Science.

[2]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[3]  Stephen Muggleton Inductive Logic Programming: 6th International Workshop, ILP-96, Stockholm, Sweden, August 26-28, 1996, Selected Papers , 1997 .

[4]  François Denis PAC Learning from Positive Statistical Queries , 1998, ALT.

[5]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[6]  Donghai Guan,et al.  SMS Classification Based on Naïve Bayes Classifier and Apriori Algorithm Frequent Itemset , 2014 .

[7]  Kyoung-Ju Lee,et al.  Mobile Junk Message Filter Reflecting User Preference , 2012 .

[8]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[9]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Stephen Muggleton,et al.  Learning from Positive Data , 1996, Inductive Logic Programming Workshop.

[12]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[13]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[16]  Akebo Yamakami,et al.  Contributions to the study of SMS spam filtering: new collection and results , 2011, DocEng '11.

[17]  Lei Xi,et al.  Rough set and ensemble learning based semi-supervised algorithm for text classification , 2011, Expert Syst. Appl..

[18]  Wei Zheng,et al.  Filtering Short Message Spam of Group Sending Using CAPTCHA , 2008, First International Workshop on Knowledge Discovery and Data Mining (WKDD 2008).

[19]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[20]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[21]  Sarah Jane Delany,et al.  SMS spam filtering: Methods and data , 2012, Expert Syst. Appl..

[22]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[23]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[24]  David A. Landgrebe,et al.  The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon , 1994, IEEE Trans. Geosci. Remote. Sens..

[25]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[26]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[27]  Jung-Tae Lee,et al.  Content-based mobile spam classification using stylistically motivated features , 2012, Pattern Recognit. Lett..

[28]  Deokjai Choi,et al.  Independent and Personal SMS Spam Filtering , 2011, 2011 IEEE 11th International Conference on Computer and Information Technology.

[29]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[30]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[31]  Qiang Yang,et al.  SMS Spam Detection Using Noncontent Features , 2012, IEEE Intelligent Systems.

[32]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.