Optimizing Hadoop Framework for Solid State Drives

Solid state drives (SSDs) have been widely used in Hadoop clusters ever since their introduction to the big data industry. However, the current Hadoop framework is not optimized to take full advantage of SSDs. In this paper, we introduce architectural improvements in the core Hadoop components to fully exploit the performance benefits of SSDs for data-and compute-intensive workloads. The improved architecture features: a simplified data handling algorithm that utilizes SSD's high random IOPS to store and shuffle the map output data, an accurate pre-read model for HDFS based on libaio to reduce read latency and improve request parallelism, a record size based reduce scheduler to overcome the data skew problem in the reduce phase, and a new block placement policy of HDFS based on the disk wear information to manage SSDs' lifetime. The simplified map output collector and the pre-read model of HDFS show 30% and 18% of performance improvement with Terasort and DFSIO benchmarks, respectively. The modified reduce scheduler shows 12% faster execution time with a real MapReduce application. To extend these results, we affirm that the modified structure also achieves 21% performance improvement on Samsung's MicroBrick-based hyperscale system.

[1]  Seung-Jong Park,et al.  Evaluating different distributed-cyber-infrastructure for data and compute intensive scientific application , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[2]  Liu Qin,et al.  Optimizing Hadoop Block Placement Policy and Cluster Blocks Distribution , 2013 .

[3]  Jaehwan Lee,et al.  Optimizing the Hadoop MapReduce Framework with high-performance storage devices , 2015, The Journal of Supercomputing.

[4]  Lijun Wang,et al.  A Reduce Task Scheduler for MapReduce with Minimum Transmission Cost Based on Sampling Evaluation , 2015 .

[5]  Jaehwan Lee,et al.  Introducing SSDs to the Hadoop MapReduce Framework , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[6]  Yanpei Chen,et al.  The Truth About MapReduce Performance on SSDs , 2014, LISA.

[7]  M. Balazinska,et al.  A Study of Skew in MapReduce Applications , 2011 .

[8]  Jan Fostier,et al.  Halvade: scalable sequence analysis with MapReduce , 2015, Bioinform..

[9]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Ethan L. Miller,et al.  Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components , 2015, SIGMOD Conference.

[11]  Antony I. T. Rowstron,et al.  Scale-up vs scale-out for Hadoop: time to rethink? , 2013, SoCC.

[12]  Alan L. Cox,et al.  The Hadoop distributed filesystem: Balancing portability and performance , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[13]  Andrea C. Arpaci-Dusseau,et al.  HARDFS: hardening HDFS with selective and lightweight versioning , 2013, FAST.

[14]  Jihong Kim,et al.  ABC: dynamic configuration management for MicroBrick-based cloud computing systems , 2014, Middleware.

[15]  Jie Huang,et al.  The HiBench benchmark suite: Characterization of the MapReduce-based data analysis , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[16]  Jimmy J. Lin,et al.  The Curse of Zipf and Limits to Parallelization: An Look at the Stragglers Problem in MapReduce , 2009, LSDS-IR@SIGIR.