Towards Designing an Email Classification System Using Multi-view Based Semi-supervised Learning

The goal of email classification is to classify user emails into spam and legitimate ones. Many supervised learning algorithms have been invented in this domain to accomplish the task, and these algorithms require a large number of labeled training data. However, data labeling is a labor intensive task and requires in-depth domain knowledge. Thus, only a very small proportion of the data can be labeled in practice. This bottleneck greatly degrades the effectiveness of supervised email classification systems. In order to address this problem, in this work, we first identify some critical issues regarding supervised machine learning-based email classification. Then we propose an effective classification model based on multi-view disagreement-based semi-supervised learning. The motivation behind the attempt of using multi-view and semi-supervised learning is that multi-view can provide richer information for classification, which is often ignored by literature, and semi-supervised learning supplies with the capability of coping with labeled and unlabeled data. In the evaluation, we demonstrate that the multi-view data can improve the email classification than using a single view data, and that the proposed model working with our algorithm can achieve better performance as compared to the existing similar algorithms.

[1]  Wenjuan Li,et al.  Enhancing email classification using data reduction and disagreement-based semi-supervised learning , 2014, 2014 IEEE International Conference on Communications (ICC).

[2]  Alexander Zien,et al.  Semi-Supervised Text Classification Using EM , 2006 .

[3]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[4]  Saurabh Bagchi,et al.  Spam detection in voice-over-IP calls through semi-supervised clustering , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[5]  Stephen R. Garner,et al.  WEKA: The Waikato Environment for Knowledge Analysis , 1996 .

[6]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[7]  Shrawan Kumar Trivedi,et al.  Effect of feature selection methods on machine learning classifiers for detecting email spams , 2013, RACS.

[8]  Tsuhan Chen,et al.  Semi-supervised co-training and active learning based approach for multi-view intrusion detection , 2009, SAC '09.

[9]  David Mandell Freeman,et al.  Using naive bayes to detect spammy names in social networks , 2013, AISec.

[10]  Fayez Gebali,et al.  Binary LNS-based naive Bayes hardware classifier for spam control , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[11]  Victor Cheng,et al.  Combining Supervised and Semi-supervised Classifier for Personalized Spam Filtering , 2007, PAKDD.

[12]  Stan Matwin,et al.  Email Classification with Temporal Features , 2004, Intelligent Information Systems.

[13]  El-Sayed M. El-Alfy,et al.  Using GMDH-based networks for improved spam detection and email feature analysis , 2011, Appl. Soft Comput..

[14]  Charles L. A. Clarke,et al.  Clustering for semi-supervised spam filtering , 2011, CEAS '11.

[15]  Mark Allman,et al.  A large-scale empirical analysis of email spam detection through network characteristics in a stand-alone enterprise , 2014, Comput. Networks.

[16]  Maozhen Li,et al.  A survey of emerging approaches to spam filtering , 2012, CSUR.

[17]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  Zhi-Hua Zhou,et al.  Multi-Label Learning by Instance Differentiation , 2007, AAAI.

[19]  Rodica Potolea,et al.  Spam detection filter using KNN algorithm and resampling , 2010, Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing.

[20]  Nizar Bouguila,et al.  A study of spam filtering using support vector machines , 2010, Artificial Intelligence Review.

[21]  Saharon Rosset,et al.  Model selection via the AUC , 2004, ICML.

[22]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[23]  Shiliang Sun,et al.  A survey of multi-view machine learning , 2013, Neural Computing and Applications.

[24]  Victor Cheng,et al.  Personalized Spam Filtering with Semi-supervised Classifier Ensemble , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[25]  Calton Pu,et al.  A study on evolution of email spam over fifteen years , 2013, 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[26]  Jian Pei,et al.  Email mining: tasks, common techniques, and tools , 2013, Knowledge and Information Systems.

[27]  Yang Xiang,et al.  Email classification using data reduction method , 2010, 2010 5th International ICST Conference on Communications and Networking in China.

[28]  Tom M. Mitchell,et al.  Semi-Supervised Text Classification Using EM , 2006, Semi-Supervised Learning.

[29]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[30]  Qiang Yang,et al.  Semi-Supervised Learning with Very Few Labeled Training Examples , 2007, AAAI.

[31]  Baowen Xu,et al.  Harmonic functions based semi-supervised learning for web spam detection , 2011, SAC '11.

[32]  Yudong Zhang,et al.  Binary PSO with mutation operator for feature selection using decision tree applied to spam detection , 2014, Knowl. Based Syst..

[33]  Ming Yang,et al.  Semi Supervised Image Spam Hunter: A Regularized Discriminant EM Approach , 2009, ADMA.

[34]  B. John Oommen,et al.  Anomaly Detection in Dynamic Systems Using Weak Estimators , 2011, TOIT.

[35]  Miguel Rio,et al.  Symbiotic filtering for spam email detection , 2011, Expert Syst. Appl..

[36]  Yiyu Yao,et al.  Cost-sensitive three-way email spam filtering , 2013, Journal of Intelligent Information Systems.

[37]  Zhi-Hua Zhou,et al.  On multi-view active learning and the combination with semi-supervised learning , 2008, ICML '08.

[38]  Hamideh Afsarmanesh,et al.  Disagreement-Based Co-training , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[39]  Blaine Nelson,et al.  Analyzing Behavioral Features for Email Classification , 2005, CEAS.

[40]  Gordon V. Cormack,et al.  Semi-supervised spam filtering using aggressive consistency learning , 2010, SIGIR '10.