Spark deployment and performance evaluation on the MareNostrum supercomputer

In this paper we present a framework to enable data-intensive Spark workloads on MareNostrum, a petascale supercomputer designed mainly for compute-intensive applications. As far as we know, this is the first attempt to investigate optimized deployment configurations of Spark on a petascale HPC setup. We detail the design of the framework and present some benchmark data to provide insights into the scalability of the system. We examine the impact of different configurations including parallelism, storage and networking alternatives, and we discuss several aspects in executing Big Data workloads on a computing system that is based on the compute-centric paradigm. Further, we derive conclusions aiming to pave the way towards systematic and optimized methodologies for fine-tuning data-intensive application on large clusters emphasizing on parallelism configurations.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[3]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[4]  Abhishek Gupta,et al.  Evaluation of HPC Applications on Cloud , 2011, 2011 Sixth Open Cirrus Summit.

[5]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[6]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[7]  Magdalena Balazinska,et al.  SkewTune: mitigating skew in mapreduce applications , 2012, SIGMOD Conference.

[8]  Xin Yuan,et al.  A comparative study of high-performance computing on the cloud , 2013, HPDC.

[9]  Shivnath Babu,et al.  Cumulon: optimizing statistical data analysis in the cloud , 2013, SIGMOD '13.

[10]  Scott Shenker,et al.  Shark: SQL and rich analytics at scale , 2012, SIGMOD '13.

[11]  Teng Wang,et al.  Characterization and Optimization of Memory-Resident MapReduce on HPC Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[12]  Judy Qiu,et al.  A Tale of Two Data-Intensive Paradigms: Applications, Abstractions, and Architectures , 2014, 2014 IEEE International Congress on Big Data.

[13]  Mirek Riedewald,et al.  Anti-combining for MapReduce , 2014, SIGMOD Conference.

[14]  Holden Karau,et al.  Learning Spark - lightning-fast data analysis, 1st Edition , 2015 .

[15]  Carsten Binnig,et al.  An Architecture for Compiling UDF-centric Workflows , 2015, Proc. VLDB Endow..

[16]  Neil J. Gunther,et al.  Hadoop superlinear scalability , 2015, Commun. ACM.