Business email classification using incremental subspace learning

We consider a new text classification task: classifying enterprise email messages into sensitive business topics. The identification of sensitive topics in email messages is important for enterprises to safeguard their critical data such as intellectual properties and trade secrets. We introduce the incremental PCA (Principal Component Analysis) to email representation, which can learn a feature subspace incrementally and effectively to reduce the feature dimensionality. Linear SVM (Support Vector Machine) is then adopted to learn the classification models. We validate our approaches with 5,000 emails extracted from the Enron Email set. Experimental results show that SVM outperforms other classification methods, and the incremental PCA produces a substantial reduction in the processing time and a slight increase in the classification accuracy compared to SVM with all the features.

[1]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[2]  Jafar Adibi,et al.  The Enron Email Dataset Database Schema and Brief Statistical Report , 2004 .

[3]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[4]  Ani Nenkova,et al.  Email classification for contact centers , 2003, SAC '03.

[5]  Michael Lindenbaum,et al.  Sequential Karhunen-Loeve basis extraction and its application to images , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[6]  Ben P. Milner,et al.  Email classification for automated service handling , 2006, SAC '06.

[7]  Karen McCullagh Data Sensitivity: Proposals for Resolving the Conundrum , 2007 .

[8]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[9]  Tieniu Tan,et al.  Visual tracking via incremental self-tuning particle filtering on the affine group , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Sung-Hyuk Cha,et al.  A Neural Network Classifier for Junk E-Mail , 2004, Document Analysis Systems.

[11]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[12]  Tieniu Tan,et al.  Efficient Object Tracking by Incremental Self-Tuning Particle Filtering on the Affine Group , 2012, IEEE Transactions on Image Processing.

[13]  Md. Rafiqul Islam,et al.  Machine Learning Approaches for Modeling Spammer Behavior , 2010, AIRS.

[14]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[15]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[16]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.