PCA document reconstruction for email classification

This paper presents a document classifier based on text content features and its application to email classification. We test the validity of a classifier which uses Principal Component Analysis Document Reconstruction (PCADR), where the idea is that principal component analysis (PCA) can compress optimally only the kind of documents-in our experiments email classes-that are used to compute the principal components (PCs), and that for other kinds of documents the compression will not perform well using only a few components. Thus, the classifier computes separately the PCA for each document class, and when a new instance arrives to be classified, this new example is projected in each set of computed PCs corresponding to each class, and then is reconstructed using the same PCs. The reconstruction error is computed and the classifier assigns the instance to the class with the smallest error or divergence from the class representation. We test this approach in email filtering by distinguishing between two message classes (e.g. spam from ham, or phishing from ham). The experiments show that PCADR is able to obtain very good results with the different validation datasets employed, reaching a better performance than the popular Support Vector Machine classifier.

[1]  Norman M. Sadeh,et al.  Learning to detect phishing emails , 2007, WWW '07.

[2]  Naohiro Ishii,et al.  Text Classification: Combining Grouping, LSA and kNN vs Support Vector Machine , 2006, KES.

[3]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[4]  G. Stewart,et al.  An Algorithm for Generalized Matrix Eigenvalue Problems. , 1973 .

[5]  Marie-Francine Moens,et al.  Highly discriminative statistical features for email classification , 2012, Knowledge and Information Systems.

[6]  Bernardete Ribeiro,et al.  Knowledge Extraction with Non-Negative Matrix Factorization for Text Classification , 2009, IDEAL.

[7]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[8]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[9]  Olac Fuentes,et al.  Object detection using image reconstruction with PCA , 2009, Image Vis. Comput..

[10]  Kevin R. Gee Using latent semantic indexing to filter spam , 2003, SAC '03.

[11]  G. Bormetti Option Pricing under Ornstein-uhlenbeck Stochastic Volatility Received (day Month Year) Revised (day Month Year) , 2002 .

[12]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[13]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[14]  D. Sculley,et al.  Relaxed online SVMs for spam filtering , 2007, SIGIR.

[15]  Qiang Pu,et al.  Short-Text Classification Based on ICA and LSA , 2006, ISNN.

[16]  Efstathios Stamatatos,et al.  Words versus Character n-Grams for Anti-Spam Filtering , 2007, Int. J. Artif. Intell. Tools.

[17]  Wilfried N. Gansterer,et al.  Spam Filtering Based on Latent Semantic Indexing , 2008 .

[18]  C. Parvin An Introduction to Multivariate Statistical Analysis, 3rd ed. T.W. Anderson. Hoboken, NJ: John Wiley & Sons, 2003, 742 pp., $99.95, hardcover. ISBN 0-471-36091-0. , 2004 .

[19]  R. Hartley,et al.  PowerFactorization : 3D reconstruction with missing or uncertain data , 2003 .

[20]  Tom Fawcett "In vivo" spam filtering: A challenge problem for data mining , 2004, ArXiv.

[21]  Tom Fawcett,et al.  "In vivo" spam filtering: a challenge problem for KDD , 2003, SKDD.

[22]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..

[23]  Simson L. Garfinkel,et al.  Stopping Spam , 1998 .

[24]  Bo Yu,et al.  A comparative study for content-based dynamic spam classification using four machine learning algorithms , 2008, Knowl. Based Syst..

[25]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[26]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[27]  Kari Torkkola,et al.  Linear Discriminant Analysis in Document Classification , 2007 .

[28]  Christopher Meek,et al.  Challenges of the Email Domain for Text Classification , 2000, ICML.

[29]  Wilfried N. Gansterer,et al.  E-Mail Classification for Phishing Defense , 2009, ECIR.

[30]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[31]  René Vidal,et al.  Multiframe Motion Segmentation with Missing Data Using PowerFactorization and GPCA , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[32]  Georgios Paliouras,et al.  An evaluation of Naive Bayesian anti-spam filtering , 2000, ArXiv.

[33]  Walmir M. Caminhas,et al.  A review of machine learning approaches to Spam filtering , 2009, Expert Syst. Appl..

[34]  Wilfried N. Gansterer,et al.  Utilizing Nonnegative Matrix Factorization for Email Classification Problems , 2010 .

[35]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[36]  Soo-Young Lee,et al.  Non-negative Matrix Factorization Based Text Mining: Feature Extraction and Classification , 2006, ICONIP.

[37]  I. Jolliffe Principal Component Analysis , 2002 .

[38]  Marie-Francine Moens,et al.  Using Biased Discriminant Analysis for Email Filtering , 2010, KES.

[39]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[40]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[41]  Heiko Hoffmann,et al.  Kernel PCA for novelty detection , 2007, Pattern Recognit..

[42]  Nicolas Gillis,et al.  Document classification using nonnegative matrix factorization and underapproximation , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[43]  Takeo Kanade,et al.  A sequential factorization method for recovering shape and motion from image streams , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Gary Robinson,et al.  A statistical approach to the spam problem , 2003 .

[45]  Kam-Fai Wong,et al.  Binarization Approaches to Email Categorization , 2006, ICCPOL.

[46]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[47]  Suku Nair,et al.  A comparison of machine learning techniques for phishing detection , 2007, eCrime '07.

[48]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .