A MapReduce-Based Distributed SVM for Scalable Data Type Classification

Data type classification is a significant problem in digital forensics and information security field. Methods based on support vector machine have proven the most successful across varying classification approaches in the previous work. However, the training process of SVM is notably computationally intensive with the number of training vectors increased rapidly. In this study, we proposed parallel distributed SVM (PDSVM) based on Hadoop MapReduce for scalable data type classification. First the map phase determines support vectors (SVs) in the splits of dataset by running the sequential minimal optimization. Then the reduce phase merges SVs and computes the degree of global convergence. Finally, PDSVM utilizes the global convergence SVs to get SVM model. The experimental results demonstrate that PDSVM can not only process large scale training dataset, but also perform well in the term of classification accuracy.

[1]  Edward Y. Chang,et al.  Parallelizing Support Vector Machines on Distributed Computers , 2007, NIPS.

[2]  Ting Wu,et al.  A Fragment Classification Method Depending on Data Type , 2015, 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing.

[3]  Ferhat Özgür Çatak Polarization Measurement of High Dimensional Social Media Messages With Support Vector Machine Algorithm Using Mapreduce , 2014, ArXiv.

[4]  Robert F. Erbacher,et al.  SÁDI - Statistical Analysis for Data Type Identification , 2008, 2008 Third International Workshop on Systematic Approaches to Digital Forensic Engineering.

[5]  Shuai Li,et al.  A MapReduce based parallel SVM for large-scale predicting protein-protein interactions , 2014, Neurocomputing.

[6]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[7]  E. J. van Eijk,et al.  Digital forensics as a service: Game on , 2015, Digit. Investig..

[8]  Steven Tuecke,et al.  The Physiology of the Grid An Open Grid Services Architecture for Distributed Systems Integration , 2002 .

[9]  Colin Morris,et al.  Using NLP techniques for file fragment classification , 2012, Digit. Investig..

[10]  J. Platt Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines , 1998 .

[11]  Gregory B. White,et al.  An Approach to Detect Executable Content for Anomaly Based Network Intrusion Detection , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[12]  Minghe Sun,et al.  Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification , 2013, IEEE Transactions on Information Forensics and Security.

[13]  Ke Xu,et al.  A MapReduce based Parallel SVM for Email Classification , 2014, J. Networks.

[14]  S. Kong,et al.  Frame-Based Recovery of Corrupted Video Files Using Video Codec Specifications , 2014, IEEE Transactions on Image Processing.

[15]  Ponnuthurai N. Suganthan,et al.  A Novel Support Vector Machine Approach to High Entropy Data Fragment Classification , 2010, SAISMC.

[16]  Hai Jin,et al.  A distributed SVM method based on the iterative MapReduce , 2015, Proceedings of the 2015 IEEE 9th International Conference on Semantic Computing (IEEE ICSC 2015).

[17]  François Poulet,et al.  Classifying one billion data with a new distributed svm algorithm , 2006, 2006 International Conference onResearch, Innovation and Vision for the Future.

[18]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[19]  Robert F. Erbacher,et al.  Identification and Localization of Data Types within Large-Scale File Systems , 2007, Second International Workshop on Systematic Approaches to Digital Forensic Engineering (SADFE'07).

[20]  Mohsen Toorani,et al.  Feature-based Type Identification of File Fragments , 2013, Secur. Commun. Networks.

[21]  Mohsen Toorani,et al.  A new approach to content-based file type detection , 2008, 2008 IEEE Symposium on Computers and Communications.

[22]  Maozhen Li,et al.  A Resource Aware MapReduce Based Parallel SVM for Large Scale Image Classifications , 2016, Neural Processing Letters.

[23]  Tamir Hazan,et al.  A Parallel Decomposition Solver for SVM: Distributed dual ascend using Fenchel Duality , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Ferhat Özgür Çatak,et al.  CloudSVM: Training an SVM Classifier in Cloud Computing Systems , 2012, ICPCA/SWS.