Customized Filesystem with Dynamic Stripe Strategies on Lustre-Based Hadoop

With large-scale data exploding so quickly that the traditional big data processing framework Hadoop has met its bottleneck on data storing layer. Running Hadoop on modern HPC clusters has attracted much attention due to its unique data processing and analyzing capabilities. Lustre file system is a promising parallel storage file system occupied HPC file system market for many years. Thus, Lustre-based Hadoop platform will pose many new opportunities and challenges on today’s data era. In this paper, we customized our LustreFileSystem class which inherits from FileSystem class (inner Hadoop source code) to build our Lustre-based Hadoop. And to make full use of the high-performance in Lustre file system, we propose a novel dynamic stripe strategy to optimize stripe size during writing data to Lustre file system. Our results indicate that, we can improve the performance obviously in throughput (mb/sec) about 3x in writing and 11x in reading, and average IO rate (mb/sec) at least 3 times at the same time when compared with initial Hadoop. Besides, our dynamic stripe strategy can smooth the reading operation and give a slight improvement on writing procedure when compared with existing Lustre-based Hadoop.

[1]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2013, ICAC 2013.

[2]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[3]  Dhabaleswar K. Panda,et al.  HOMR: a hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects , 2014, ICS '14.

[4]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.

[5]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[6]  Carlos Maltzahn,et al.  Ceph as a Scalable Alternative to the Hadoop Distributed File System , 2010, login Usenix Mag..

[7]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[8]  Dhabaleswar K. Panda,et al.  High-Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[9]  Osamu Tatebe,et al.  Using the Gfarm File System as a POSIX Compatible Storage Platform for Hadoop MapReduce Applications , 2011, 2011 IEEE/ACM 12th International Conference on Grid Computing.

[10]  Saneyasu Yamaguchi,et al.  Improving the I/O Performance in the Reduce Phase of Hadoop , 2015, CANDAR.

[11]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[12]  Robert B. Ross,et al.  On the duality of data-intensive file system design: Reconciling HDFS and PVFS , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[14]  Pietro Michiardi,et al.  HFSP: Bringing Size-Based Scheduling To Hadoop , 2017, IEEE Transactions on Cloud Computing.

[15]  R. Aruna,et al.  Improving the efficiency of MapReduce scheduling algorithm in Hadoop , 2015, 2015 International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT).

[16]  Dhabaleswar K. Panda,et al.  A Comprehensive Study of MapReduce Over Lustre for Intermediate Data Placement and Shuffle Strategies on HPC Clusters , 2017, IEEE Transactions on Parallel and Distributed Systems.