论文信息 - Falcon: Scaling IO Performance in Multi-SSD Volumes

Falcon: Scaling IO Performance in Multi-SSD Volumes

With the high throughput offered by solid-state drives (SSDs), multi-SSD volumes have become an attractive storage solution for big data applications. Unfortunately, the IO stack in current operating systems imposes a number of volume-level limitations, such as per-volume based IO processing in the block layer, single flush thread per volume for buffer cache management, locks for parallel IOs on a file, all of which lower the performance that could otherwise be achieved on multi-SSD volumes. To address this problem, we propose a new design of per-drive IO processing that separates two key functionalities of IO batching and IO serving in the IO stack. Specifically, we design and develop Falcon1 that consists of two major components: Falcon IO Management Layer that batches the incoming IOs at the volume level, and Falcon Block Layer that parallelizes IO serving on the SSD level in a new block layer. Compared to the current practice, Falcon significantly speeds up direct random file read and write on an 8-SSD volume by 1.77× and 1.59× respectively, and also shows strong scalability across different numbers of drives and various storage controllers. In addition, Falcon improves the performance of a variety of applications by 1.69×.

H. Howie Huang | Pradeep Kumar | H. H. Huang | P. Kumar

[1] Bryan Veal,et al. Towards SSD-Ready Enterprise Platforms , 2010, ADMS@VLDB.

[2] Peter J. Varman,et al. Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation , 2014, FAST.

[3] Francesco De Pellegrini,et al. Distributed k-Core Decomposition , 2013 .

[4] Steven Swanson,et al. DC express: shortest latency protocol for reading phase change memory over PCI express , 2014, FAST.

[5] Jin-Soo Kim,et al. NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs , 2016, HotStorage.

[6] Guy E. Blelloch,et al. GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[7] Hyeonsang Eom,et al. Exploiting Peak Device Throughput from Random Access Workload , 2012, HotStorage.

[8] Wenguang Chen,et al. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[9] David J. Lilja,et al. High performance solid state storage under Linux , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[10] Alexander S. Szalay,et al. Toward millions of file system IOPS on low-cost, commodity hardware , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11] Youyou Lu,et al. ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices , 2016, USENIX Annual Technical Conference.

[12] Frank Hady,et al. When poll is better than interrupt , 2012, FAST.

[13] Tianyu Wo,et al. SpanFS: A Scalable File System on Fast Storage Devices , 2015, USENIX Annual Technical Conference.

[14] Rong Chen,et al. PowerLyra: differentiated graph computation and partitioning on skewed graphs , 2015, EuroSys.

[15] Uzi Vishkin,et al. An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[16] Andrea C. Arpaci-Dusseau,et al. WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[17] H. Howie Huang,et al. G-Store: High-Performance Graph Store for Trillion-Edge Processing , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Kun-Lung Wu,et al. Streaming Algorithms for k-core Decomposition , 2013, Proc. VLDB Endow..

[19] Heon Young Yeom,et al. Dynamic Interval Polling and Pipelined Post I/O Processing for Low-Latency Storage Class Memory , 2013, HotStorage.

[20] Steven Swanson,et al. Providing safe, user space access to fast, solid state disks , 2012, ASPLOS XVII.

[21] Bianca Schroeder,et al. sRoute: Treating the Storage Stack Like a Network , 2016, FAST.

[22] Mohan Kumar,et al. Mosaic: Processing a Trillion-Edge Graph on a Single Machine , 2017, EuroSys.

[23] Bianca Schroeder,et al. Treating the Storage Stack Like a Network , 2017, ACM Trans. Storage.

[24] Giri Narasimhan,et al. CacheDedup: In-line Deduplication for Flash Caching , 2016, FAST.

[25] Uzi Vishkin,et al. An O(n² log n) Parallel MAX-FLOW Algorithm , 1982, J. Algorithms.

[26] Changwoo Min,et al. Understanding Manycore Scalability of File Systems , 2016, USENIX Annual Technical Conference.

[27] Raju Rangaswami,et al. I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[28] Alexander S. Szalay,et al. FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs , 2014, FAST.

[29] H. Howie Huang,et al. Graphene: Fine-Grained IO Management for Graph Computing , 2017, FAST.

[30] Rajiv Gupta,et al. Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing , 2016, USENIX Annual Technical Conference.

[31] Raju Rangaswami,et al. Non-blocking Writes to Files , 2015, FAST.

[32] Xiaoqian Jiang,et al. Fast and Robust Parallel SGD Matrix Factorization , 2015, KDD.

[33] 신웅. OS I/O path optimizations for flash solid-state drives , 2017 .

[34] Andrea C. Arpaci-Dusseau,et al. Proceedings of the 2002 Usenix Annual Technical Conference Bridging the Information Gap in Storage Protocol Stacks , 2022 .

[35] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[36] Sun Zhen,et al. Using Hints to Improve Inline Block-layer Deduplication , 2016, FAST.

[37] Willy Zwaenepoel,et al. Chaos: scale-out graph processing from secondary storage , 2015, SOSP.

[38] Philippe Bonnet,et al. Linux block IO: introducing multi-queue SSD access on multi-core systems , 2013, SYSTOR '13.

[39] Rajesh K. Gupta,et al. Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[40] Erez Zadok,et al. vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O , 2017, FAST.

[41] Hyeonsang Eom,et al. Optimizing the Block I/O Subsystem for Fast Storage Devices , 2014, ACM Trans. Comput. Syst..

[42] Willy Zwaenepoel,et al. X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.