Performance Implications of SSDs in Virtualized Hadoop Clusters

BigData manipulates a massive volume of data for which the traditional techniques are not effective. Apache Hadoop is currently a most popular software framework supporting BigData analysis. As the scale of Hadoop cluster grows larger, building Hadoop clusters in virtualized environment draws a great attention. However, the performance optimization of Hadoop cluster in virtualized environment is difficult because of the virtualization overhead. In this paper the performance implications of SSDs in virtualized Hadoop clusters is identified and the overhead of virtualization is shown to be minimized with SSDs. The study presented in this paper reveals that the main virtualization overhead is I/O bottleneck due to fragmented and randomized I/O workload aggravated by virtualization. However, SSDs are more tolerable to the workload than HDDs. As a result, the virtualization overhead with SSDs is much less than with HDDs. Also, in the case of SSDs, the virtualized Hadoop cluster sustains good performance regardless of the number of VMs.

[1]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  GhemawatSanjay,et al.  The Google file system , 2003 .

[4]  Alexandros Labrinidis,et al.  Challenges and Opportunities with Big Data , 2012, Proc. VLDB Endow..

[5]  Calton Pu,et al.  Performance Overhead among Three Hypervisors: An Experimental Study Using Hadoop Benchmarks , 2013, 2013 IEEE International Congress on Big Data.

[6]  David Chisnall,et al.  The Definitive Guide to the Xen Hypervisor , 2007 .

[7]  Hai Jin,et al.  Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.

[8]  A. Kivity,et al.  kvm : the Linux Virtual Machine Monitor , 2007 .

[9]  Randy H. Katz,et al.  A view of cloud computing , 2010, CACM.

[10]  Jungkyu Han,et al.  Design and performance evaluation for Hadoop clusters on virtualized environment , 2013, The International Conference on Information Networking 2013 (ICOIN).

[11]  Horacio González-Vélez,et al.  Performance evaluation of MapReduce using full virtualisation on a departmental cloud , 2011, Int. J. Appl. Math. Comput. Sci..

[12]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[13]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).