A MapReduce based Parallel SVM for Email Classification

Support Vector Machine (SVM) is a powerful classification and regression tool. Varying approaches including SVM based techniques are proposed for email classification. Automated email classification according to messages or user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text machine learning research. This paper presents a parallel SVM based on MapReduce (PSMR) algorithm for email classification. We discuss the challenges that arise from differences between email foldering and traditional document classification. We show experimental results from an array of automated classification methods and evaluation methodologies, including Naive Bayes, SVM and PSMR method of foldering results on the Enron datasets based on the timeline. By distributing, processing and optimizing the subsets of the training data across multiple participating nodes, the parallel SVM based on MapReduce algorithm reduces the training time significantly

[1]  Jacek Gondzio,et al.  Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training , 2009, J. Mach. Learn. Res..

[2]  Xinfang Zhang,et al.  Scalable Influence Analysis in Mobile Social Networks , 2012 .

[3]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[4]  Manuel Martín-Merino,et al.  Combining SVM Classifiers for Email Anti-spam Filtering , 2007, IWANN.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Danny Dolev,et al.  A Gaussian Belief Propagation Solver for Large Scale Support Vector Machines , 2008, ArXiv.

[7]  Padhraic Smyth,et al.  TopicNets: Visual Analysis of Large Text Corpora with Topic Modeling , 2012, TIST.

[8]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[9]  Igor Durdanovic,et al.  Parallel Support Vector Machines: The Cascade SVM , 2004, NIPS.

[10]  François Poulet,et al.  Classifying one billion data with a new distributed svm algorithm , 2006, 2006 International Conference onResearch, Innovation and Vision for the Future.

[11]  Jeffrey Ellen,et al.  Implicit Group Membership Detection in Online Text: Analysis and Applications , 2012, SBP.

[12]  Luca Zanni,et al.  A parallel solver for large quadratic programs in training support vector machines , 2003, Parallel Comput..

[13]  Qi Li,et al.  Parallel multitask cross validation for Support Vector Machine using GPU , 2013, J. Parallel Distributed Comput..

[14]  Ke Xu,et al.  An Algorithm for Detecting Group in Mobile Social Network , 2012, J. Networks.

[15]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .

[16]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[17]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[18]  Geoffrey C. Fox,et al.  Parallel Data Mining from Multicore to Cloudy Grids , 2008, High Performance Computing Workshop.

[19]  Jiang Xu,et al.  Coastline Extraction Using Support Vector Machine from Remote Sensing Image , 2013, J. Multim..

[20]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[21]  Guna Seetharaman,et al.  Semantic Concept Mining Based on Hierarchical Event Detection for Soccer Video Indexing , 2009, J. Multim..

[22]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[23]  Wen Cui,et al.  Parallel Community Mining in Social Network using Map-reduce , 2012 .

[24]  Jason D. M. Rennie ifile: An Application of Machine Learning to E-Mail Filtering , 2000 .

[25]  Enrico Blanzieri,et al.  A survey of learning-based techniques of email spam filtering , 2008, Artificial Intelligence Review.