Using J48 Tree Partitioning for scalable SVM in Spam Detection

Support Vector Machines (SVM) is a state-of-the-art, powerful algorithm in machine learning which has strong regularization attributes. Regularization points to the model generalization to the new data. Therefore, SVM can be very efficient for spam detection. Although the experimental results represent that the performance of SVM is usually more than other algorithms, but its efficiency is decreased when the number of feature of spam is increased. In this paper, a scalable SVM is proposed by using J48 tree for spam detection. In the proposed method, dataset is firstly partitioned by using J48 tree, then, features selection are applied in each partition in parallel. Consistently, selected features are used in the training phase of SVM. The propose method is evaluated conducted some benchmark datasets and the results are compared with other algorithms such as SVM and GA-SVM. The experimental results show that the proposed method is scalable when the number of features are increased and has higher accuracy compared to SVM and GA-SVM.

[1]  Bing Liu,et al.  Review spam detection , 2007, WWW '07.

[2]  Rouslan A. Moro,et al.  Support Vector Machines (SVM) as a Technique for Solvency Analysis , 2008 .

[3]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[4]  Y. Zhao,et al.  Comparison of decision tree methods for finding active objects , 2007, 0708.4274.

[5]  Nittaya Kerdprasop,et al.  Data Partitioning for Incremental Data Mining , 2003 .

[6]  Thomas G. Dietterich,et al.  Learning Boolean Concepts in the Presence of Many Irrelevant Features , 1994, Artif. Intell..

[7]  Robert Tibshirani,et al.  Margin Trees for High-dimensional Classification , 2007, J. Mach. Learn. Res..

[8]  Hang Joon Kim,et al.  Support Vector Machines for Texture Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[10]  Miao Ye,et al.  The Spam Filtering Technology Based on SVM and D-S Theory , 2008, First International Workshop on Knowledge Discovery and Data Mining (WKDD 2008).

[11]  O Chapelle Support Vector Machines for Image Classification , 1998 .

[12]  袁鹏飞,et al.  SQL Server数据复制(上) , 1999 .

[13]  Zne-Jung Lee,et al.  Parameter determination of support vector machine and feature selection using simulated annealing approach , 2008, Appl. Soft Comput..

[14]  Joel Scanlan,et al.  Catching spam before it arrives: domain specific dynamic blacklists , 2006, ACSW.

[15]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[16]  Kemal Polat,et al.  A novel hybrid intelligent method based on C4.5 decision tree classifier and one-against-all approach for multi-class classification problems , 2009, Expert Syst. Appl..

[17]  K. Bennett,et al.  A support vector machine approach to decision trees , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[18]  Sun Jian-qing,et al.  A Novel SVM Decision Tree and its application to Face Detection , 2007, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD 2007).

[19]  Chi-Jen Lu,et al.  Tree Decomposition for Large-Scale SVM Problems , 2010, 2010 International Conference on Technologies and Applications of Artificial Intelligence.

[20]  Hichem Sahbi,et al.  A Hierarchy of Support Vector Machines for Pattern Detection , 2006, J. Mach. Learn. Res..

[21]  Haruna Chiroma,et al.  A framework for selecting the optimal technique suitable for application in a data mining task , 2014 .

[22]  Lambros Ekonomou,et al.  A Review of Techniques to Counter Spam and Spit , 2009 .

[23]  Nello Cristianini,et al.  Enlarging the Margins in Perceptron Decision Trees , 2000, Machine Learning.

[24]  Huiqing Liu,et al.  A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. , 2002, Genome informatics. International Conference on Genome Informatics.

[25]  A. B. M. Shawkat Ali,et al.  Improved C4.5 algorithm for rule based classification , 2010 .