In-memory I/O and replication for HDFS with Memcached: Early experiences

Hadoop is the de-facto standard platform for large-scale data analytic applications. In spite of high availability and reliability guarantees, Hadoop Distributed File System (HDFS) suffers from huge I/O bottlenecks for storing the tri-replicated data blocks. The I/O overheads intrinsic to the HDFS architecture degrade the application performance. In this paper, we present a novel design (MEM-HDFS) to perform intelligent caching and replication of HDFS data blocks in Memcached that can significantly improve the I/O performance. In this design, we consider different deployment strategies for the Memcached servers (local and remote) and guarantee persistence of the Memcached data to HDFS on cache replacements. Performance evaluations show that MEM-HDFS can increase the read and write throughput of HDFS by up to 3.9x and 3.3x, respectively. Our design can also significantly speed up the data loading (to HDFS) phase. It reduces the execution times of data generation benchmarks like, TeraGen, RandomTextWriter, and RandomWriter by up to 50%, 39%, and 48%, respectively. The performances of other benchmarks like TeraSort and Grep are also improved by the proposed design.

[1]  Konstantin V. Shvachko,et al.  HDFS Scalability: The Limits to Growth , 2010, login Usenix Mag..

[2]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Dhabaleswar K. Panda,et al.  A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters , 2012, WBDB.

[5]  Alan L. Cox,et al.  The Hadoop distributed filesystem: Balancing portability and performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[6]  Dhabaleswar K. Panda,et al.  SOR-HDFS: a SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS , 2014, HPDC '14.

[7]  Dhabaleswar K. Panda,et al.  High performance RDMA-based design of HDFS over InfiniBand , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Dhabaleswar K. Panda,et al.  Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? , 2013, 2013 IEEE 21st Annual Symposium on High-Performance Interconnects.

[9]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[10]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[11]  Sayantan Sur,et al.  Memcached Design on High Performance RDMA Capable Interconnects , 2011, 2011 International Conference on Parallel Processing.

[12]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .