Exploring Benefits of NVMe SSDs for BigData Processing in Enterprise Data Centers

Big data processing environments such as Apache Spark are prominently deployed for applications with large scale workloads. New storage technologies such as Non-Volatile Memory Express Solid State Drives (NVMe SSDs) provide higher throughput comparing to the traditional Hard Disk Drives (HDDs). Therefore, NVMe SSDs are rapidly substituting HDDs in modern data centers. In this paper, we explore whether it is critically necessary to use NVMe SSD for a large workload running on the Spark big data framework. Specifically, we investigate what are the influential factors of application design and Spark data processing framework to exploit the benefits of NVMe SSDs. Our real experimental results reveal that some applications even with large workloads cannot fully utilize NVMe SSDs to obtain high I/O throughput. Interestingly, we find out that characteristics of Spark data processing framework such as shuffling (i.e., the volume of transition data generated by an application), and parallelism (i.e., the number of concurrently running tasks) has very crucial impacts on the performance of big data applications running on NVMe SSDs.

[1]  Tariq Rahim Soomro,et al.  Big Data Analysis: Apache Spark Perspective , 2015 .

[2]  Hyeonsang Eom,et al.  Optimizing the Block I/O Subsystem for Fast Storage Devices , 2014, ACM Trans. Comput. Syst..

[3]  Allen D. Malony,et al.  Scaling Spark on HPC Systems , 2016, HPDC.

[4]  A. Davidson Optimizing Shuffle Performance in Spark , 2013 .

[5]  Jaehwan Lee,et al.  Optimizing the Hadoop MapReduce Framework with high-performance storage devices , 2015, The Journal of Supercomputing.

[6]  Li-Pin Chang,et al.  Hybrid solid-state disks: Combining heterogeneous NAND flash in large SSDs , 2008, 2008 Asia and South Pacific Design Automation Conference.

[7]  Chaita Jani,et al.  Implementing and Improvisation of K-means Clustering , 2015 .

[8]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[9]  Masaru Kitsuregawa,et al.  Early Experience and Evaluation of File Systems on SSD with Database Applications , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[10]  Ningfang Mi,et al.  Understanding performance of I/O intensive containerized applications for NVMe SSDs , 2016, 2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC).

[11]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[12]  Adam Leventhal,et al.  Flash storage memory , 2008, CACM.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Antony I. T. Rowstron,et al.  Migrating server storage to SSDs: analysis of tradeoffs , 2009, EuroSys '09.

[15]  Heeseung Jo,et al.  SSD-HDD-Hybrid Virtual Disk in Consolidated Environments , 2009, Euro-Par Workshops.

[16]  Kenneth A. Ross,et al.  SSD bufferpool extensions for database systems , 2010, Proc. VLDB Endow..

[17]  Dongwoo Lee,et al.  Improving performance by bridging the semantic gap between multi-queue SSD and I/O virtualization framework , 2015, 2015 31st Symposium on Mass Storage Systems and Technologies (MSST).

[18]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[19]  Mrinmoy Ghosh,et al.  A Fresh Perspective on Total Cost of Ownership Models for Flash Storage in Datacenters , 2016, 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).