Improving Shuffle I/O performance for big data processing using hybrid storage

Nowadays big data analytics have been widely used in many domains, e.g., weather forecast, social network analysis, scientific computing, and bioinformatics. As indispensable part of big data analytics, MapReduce has become the de facto standard model of the distributed computing framework. With the growing complexity of software and hardware components, big data analytics systems face the challenge of performance bottleneck when handling the increasing size of computing workloads. In our study, we reveal that the existing Shuffle mechanism in the current Spark implementation is still the performance bottleneck due to the Shuffle I/O latency. We demonstrate that the Shuffle stage causes performance degradation among MapReduce jobs. By observing that the high-end Solid State Disks (SSDs) are capable of handling random writes well due to efficient flash translation layer algorithms and larger on-board I/O cache, we present a hybrid storage system-based solution that uses hard drive disks (HDDs) for large datasets storage and SSDs for improving Shuffle I/O performance to mitigate this performance degradation issue. Our extensive experiments using both real-world and synthetic workloads show that our hybrid storage system-based approach achieves performance improvement in the Shuffle stage compared with the original HDD-based Spark implementation.

[1]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[2]  Yun Tian,et al.  Improving MapReduce performance through data placement in heterogeneous Hadoop clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[3]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Yanpei Chen,et al.  The Truth About MapReduce Performance on SSDs , 2014, LISA.

[5]  GhemawatSanjay,et al.  The Google file system , 2003 .

[6]  Xiaobo Zhou,et al.  iShuffle: Improving Hadoop Performance with Shuffle-on-Write , 2017, IEEE Transactions on Parallel and Distributed Systems.

[7]  Karan Mitra,et al.  CloudSimDisk: Energy-Aware Storage Simulation in CloudSim , 2015, 2015 IEEE/ACM 8th International Conference on Utility and Cloud Computing (UCC).

[8]  Rajkumar Buyya,et al.  CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms , 2011, Softw. Pract. Exp..

[9]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[10]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[11]  Scott Shenker,et al.  Making Sense of Performance in Data Analytics Frameworks , 2015, NSDI.

[12]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[13]  Youyou Lu,et al.  Extending the lifetime of flash-based storage through reducing write amplification from file systems , 2013, FAST.

[14]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[15]  Taghi M. Khoshgoftaar,et al.  A review of data mining using big data in health informatics , 2013, Journal Of Big Data.

[16]  Weikuan Yu,et al.  Hierarchical merge for scalable MapReduce , 2012 .

[17]  David A. Patterson,et al.  Computer Organization and Design, Revised Fourth Edition, Fourth Edition: The Hardware/Software Interface , 2011 .

[18]  Si’en Chen,et al.  Analytics: The real-world use of big data in financial services studying with judge system events , 2016, Journal of Shanghai Jiaotong University (Science).

[19]  Matei A. Zaharia,et al.  An Architecture for and Fast and General Data Processing on Large Clusters , 2016 .

[20]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.