Memory requirements of hadoop, spark, and MPI based big data applications on commodity server class architectures

Emerging big data frameworks requires computational resources and memory subsystems that can naturally scale to manage massive amounts of diverse data. Given the large size and heterogeneity of the data, it is currently unclear whether big data frameworks such as Hadoop, Spark, and MPI will require high performance and large capacity memory to cope with this change and exactly what role main memory subsystems will play; particularly in terms of energy efficiency. The primary purpose of this study is to answer these questions through empirical analysis of different memory configurations available on commodity hardware and to assess the impact of these configurations on the performance and power of these well-established frameworks. Our results reveal that while for Hadoop there is no major demand for high-end DRAM, Spark and MPI iterative tasks (e.g. machine learning) are benefiting from a high-end DRAM; in particular high frequency and large numbers of channels. Among the configurable parameters, our results indicate that increasing the number of DRAM channels reduces DRAM power and improves the energy-efficiency across all three frameworks.

[1]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[2]  Thomas Willhalm,et al.  Memory system characterization of big data workloads , 2013, 2013 IEEE International Conference on Big Data.

[3]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[4]  Tao Li,et al.  Hadoop Characterization , 2015, 2015 IEEE Trustcom/BigDataSE/ISPA.