Data management becomes a complex task when hundreds of petabytes of data are being gathered, stored and processed on a day to day basis. Efficient processing of the exponentially growing data is inevitable in this context. This paper discusses about the processing of a huge amount of data through Support Vector machine (SVM) algorithm using different techniques ranging from single node Linier implementation to parallel processing using the distributed processing frameworks like Hadoop. Map-Reduce component of Hadoop performs the parallelization process which is used to feed information to Support Vector Machines (SVMs), a machine learning algorithm applicable to classification and regression analysis. Paper also does a detailed anatomy of SVM algorithm and sets a roadmap for implementing the same in both linear and Map-Reduce fashion. The main objective is explain in detail the steps involved in developing an SVM algorithm from scratch using standard linear and Map-Reduce techniques and also conduct a performance analysis across linear implementation of SVM, SVM implementation in single node Hadoop, SVM implementation in Hadoop cluster and also against a proven tool like R, gauging them with respect to the accuracy achieved, their processing pace against varying data sizes, capability to handle huge data volume without breaking etc.
[1]
Sanjay Ghemawat,et al.
MapReduce: a flexible data processing tool
,
2010,
CACM.
[2]
Saikat Mukherjee,et al.
Verification and Validation of MapReduce Progra m Model for Parallel Support Vector Machine Alg orithm on Hadoop Cluster
,
2013
.
[3]
Jiawei Han,et al.
Classifying large data sets using SVMs with hierarchical clusters
,
2003,
KDD '03.
[4]
Kunle Olukotun,et al.
Map-Reduce for Machine Learning on Multicore
,
2006,
NIPS.
[5]
Sanjay Ghemawat,et al.
MapReduce: Simplified Data Processing on Large Clusters
,
2004,
OSDI.