A novel committee selection mechanism for combining classifiers to detect unsolicited emails

Purpose The email is an important medium for sharing information rapidly. However, spam, being a nuisance in such communication, motivates the building of a robust filtering system with high classification accuracy and good sensitivity towards false positives. In that context, this paper aims to present a combined classifier technique using a committee selection mechanism where the main objective is to identify a set of classifiers so that their individual decisions can be combined by a committee selection procedure for accurate detection of spam. Design/methodology/approach For training and testing of the relevant machine learning classifiers, text mining approaches are used in this research. Three data sets (Enron, SpamAssassin and LingSpam) have been used to test the classifiers. Initially, pre-processing is performed to extract the features associated with the email files. In the next step, the extracted features are taken through a dimensionality reduction method where non-informative features are removed. Subsequently, an informative feature subset is selected using genetic feature search. Thereafter, the proposed classifiers are tested on those informative features and the results compared with those of other classifiers. Findings For building the proposed combined classifier, three different studies have been performed. The first study identifies the effect of boosting algorithms on two probabilistic classifiers: Bayesian and Naive Bayes. In that study, AdaBoost has been found to be the best algorithm for performance boosting. The second study was on the effect of different Kernel functions on support vector machine (SVM) classifier, where SVM with normalized polynomial (NP) kernel was observed to be the best. The last study was on combining classifiers with committee selection where the committee members were the best classifiers identified by the first study i.e. Bayesian and Naive bays with AdaBoost, and the committee president was selected from the second study i.e. SVM with NP kernel. Results show that combining of the identified classifiers to form a committee machine gives excellent performance accuracy with a low false positive rate. Research limitations/implications This research is focused on the classification of email spams written in English language. Only body (text) parts of the emails have been used. Image spam has not been included in this work. We have restricted our work to only emails messages. None of the other types of messages like short message service or multi-media messaging service were a part of this study. Practical implications This research proposes a method of dealing with the issues and challenges faced by internet service providers and organizations that use email. The proposed model provides not only better classification accuracy but also a low false positive rate. Originality/value The proposed combined classifier is a novel classifier designed for accurate classification of email spam.

[1]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[2]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[3]  Gordon V. Cormack,et al.  Spam and the ongoing battle for the inbox , 2007, CACM.

[4]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[5]  Shrawan Kumar Trivedi,et al.  Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams , 2013 .

[6]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[7]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[8]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[9]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[10]  Xavier Carreras,et al.  Boosting trees for clause splitting , 2001, CoNLL.

[11]  Muhammad E. Shaaban,et al.  Identifying junk electronic mail in Microsoft outlook with a support vector machine , 2003, 2003 Symposium on Applications and the Internet, 2003. Proceedings..

[12]  Georgios Paliouras,et al.  Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[13]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[14]  Stjepan Oreski,et al.  Genetic algorithm-based heuristic for feature selection in credit risk assessment , 2014, Expert Syst. Appl..

[15]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  Steve Whittaker,et al.  Supporting collaborative task management in e-mail , 2005 .

[18]  Shrawan Kumar Trivedi,et al.  Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails , 2014, SIAP.

[19]  Li Zhang,et al.  Hybrid decision tree and naïve Bayes classifiers for multi-class classification tasks , 2014, Expert Syst. Appl..

[20]  Robert E. Schapire,et al.  Using output codes to boost multiclass learning problems , 1997, ICML.

[21]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[22]  Georgios Paliouras,et al.  Stacking Classifiers for Anti-Spam Filtering of E-Mail , 2001, EMNLP.

[23]  Naveen Kumar Korada Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Maize Expert System , 2012 .

[24]  Haleh Vafaie,et al.  Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search , 2009 .

[25]  Shrawan Kumar Trivedi,et al.  An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails , 2013, 2013 IEEE 16th International Conference on Computational Science and Engineering.

[26]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[27]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[28]  Kapil Sharma,et al.  Bayesian spam classification: Time efficient radix encoded fragmented database approach , 2014, 2014 International Conference on Computing for Sustainable Global Development (INDIACom).

[29]  Naomie Salim,et al.  Detection of review spam: A survey , 2015, Expert Syst. Appl..