Efficient data access strategies for Hadoop and Spark on HPC cluster with heterogeneous storage

The most popular Big Data processing frameworks of these days are Hadoop MapReduce and Spark. Hadoop Distributed File System (HDFS) is the primary storage for these frameworks. Big Data frameworks like Hadoop MapReduce and Spark launch tasks based on data locality. In the presence of heterogeneous storage devices, when different nodes have different storage characteristics, only locality-aware data access cannot always guarantee optimal performance. Rather, storage type becomes important, specially when high performance SSD and in-memory storage devices along with high performance interconnects are available. Therefore, in this paper, we propose efficient data access strategies (e.g. Greedy (prioritizes storage type over locality), Hybrid (balances the load for locality and high performance storage), etc.) for Hadoop and Spark considering both data locality and storage types. We re-design HDFS to accommodate the enhanced access strategies. Our evaluations show that, the proposed data access strategies can improve the read performance of HDFS by up to 33% compared to the default locality-aware data access. The execution times of Hadoop and Spark Sort are also reduced by up to 32% and 17%. The performances of Hadoop and Spark TeraSort are also improved by up to 11% through our design.

[1]  David E. Culler,et al.  SEDA: an architecture for well-conditioned, scalable internet services , 2001, SOSP.

[2]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[3]  Alan L. Cox,et al.  The Hadoop distributed filesystem: Balancing portability and performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[4]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[5]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[6]  Rashid Tahir,et al.  A Dynamic Caching Mechanism for Hadoop using Memcached , 2012 .

[7]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[8]  Srikanth Kandula,et al.  PACMan: Coordinated Memory Caching for Parallel Jobs , 2012, NSDI.

[9]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[11]  Dhabaleswar K. Panda,et al.  High-Performance Design of Hadoop RPC with RDMA over InfiniBand , 2013, 2013 42nd International Conference on Parallel Processing.

[12]  Dhabaleswar K. Panda,et al.  SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS , 2014, HPDC '14.

[13]  Dhabaleswar K. Panda,et al.  HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects , 2014, ICS '14.

[14]  Dhabaleswar K. Panda,et al.  In-memory I/O and replication for HDFS with Memcached: Early experiences , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[15]  Scott Shenker,et al.  Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks , 2014, SoCC.

[16]  Dhabaleswar K. Panda,et al.  Accelerating I/O Performance of Big Data Analytics on HPC Clusters through RDMA-Based Key-Value Store , 2015, 2015 44th International Conference on Parallel Processing.

[17]  Dhabaleswar K. Panda,et al.  High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[18]  Dhabaleswar K. Panda,et al.  Performance characterization and acceleration of in-memory file systems for Hadoop and Spark applications on HPC clusters , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[19]  Dhabaleswar K. Panda,et al.  Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[20]  Dhabaleswar K. Panda,et al.  A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters , 2017, IEEE Transactions on Parallel and Distributed Systems.