Performance Analysis of Big Data Frameworks on Virtualized Clusters

Research on Big Data applications has become increasingly important for institutions and researchers worldwide. This trend is triggered by the increasingly use of systems and devices that leads to generate massive of electronic data each day. The implementation of conventional algorithms has been considered to be less efficient on managing and processing large datasets. In Big Data computation, Hadoop and Apache Spark are two open source frameworks that are commonly used and run on physical clusters. Since running these frameworks on a physical cluster costs more energy and rigid in management, in this research we evaluated their performance on virtualized clusters. Virtualization technology offers flexibility on managing cluster by sharing the resources to multiple instances. Our experiments show that in general Apache Spark is about 2–9 times better in execution time and throughput compared with Hadoop running on a virtualized environment.

[1]  Rik Goldman Learning Proxmox VE , 2016 .

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[4]  Amil Ahmad Ilham,et al.  Performance analysis of extract, transform, load (ETL) in apache Hadoop atop NAS storage using ISCSI , 2017, 2017 4th International Conference on Computer Applications and Information Processing Technology (CAIPT).

[5]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[6]  Dan Kusnetzky,et al.  Virtualization: A Manager's Guide , 2011 .

[7]  Randy H. Katz,et al.  How Hadoop Clusters Break , 2013, IEEE Software.