Falcon: Scaling IO Performance in Multi-SSD Volumes

With the high throughput offered by solid-state drives (SSDs), multi-SSD volumes have become an attractive storage solution for big data applications. Unfortunately, the IO stack in current operating systems imposes a number of volume-level limitations, such as per-volume based IO processing in the block layer, single flush thread per volume for buffer cache management, locks for parallel IOs on a file, all of which lower the performance that could otherwise be achieved on multi-SSD volumes. To address this problem, we propose a new design of per-drive IO processing that separates two key functionalities of IO batching and IO serving in the IO stack. Specifically, we design and develop Falcon1 that consists of two major components: Falcon IO Management Layer that batches the incoming IOs at the volume level, and Falcon Block Layer that parallelizes IO serving on the SSD level in a new block layer. Compared to the current practice, Falcon significantly speeds up direct random file read and write on an 8-SSD volume by 1.77× and 1.59× respectively, and also shows strong scalability across different numbers of drives and various storage controllers. In addition, Falcon improves the performance of a variety of applications by 1.69×.

[1]  Bryan Veal,et al.  Towards SSD-Ready Enterprise Platforms , 2010, ADMS@VLDB.

[2]  Peter J. Varman,et al.  Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation , 2014, FAST.

[3]  Francesco De Pellegrini,et al.  Distributed k-Core Decomposition , 2013 .

[4]  Steven Swanson,et al.  DC express: shortest latency protocol for reading phase change memory over PCI express , 2014, FAST.

[5]  Jin-Soo Kim,et al.  NVMeDirect: A User-space I/O Framework for Application-specific Optimization on NVMe SSDs , 2016, HotStorage.

[6]  Guy E. Blelloch,et al.  GraphChi: Large-Scale Graph Computation on Just a PC , 2012, OSDI.

[7]  Hyeonsang Eom,et al.  Exploiting Peak Device Throughput from Random Access Workload , 2012, HotStorage.

[8]  Wenguang Chen,et al.  GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning , 2015, USENIX ATC.

[9]  David J. Lilja,et al.  High performance solid state storage under Linux , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Alexander S. Szalay,et al.  Toward millions of file system IOPS on low-cost, commodity hardware , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  Youyou Lu,et al.  ParaFS: A Log-Structured File System to Exploit the Internal Parallelism of Flash Devices , 2016, USENIX Annual Technical Conference.

[12]  Frank Hady,et al.  When poll is better than interrupt , 2012, FAST.

[13]  Tianyu Wo,et al.  SpanFS: A Scalable File System on Fast Storage Devices , 2015, USENIX Annual Technical Conference.

[14]  Rong Chen,et al.  PowerLyra: differentiated graph computation and partitioning on skewed graphs , 2015, EuroSys.

[15]  Uzi Vishkin,et al.  An O(log n) Parallel Connectivity Algorithm , 1982, J. Algorithms.

[16]  Andrea C. Arpaci-Dusseau,et al.  WiscKey: Separating Keys from Values in SSD-conscious Storage , 2016, FAST.

[17]  H. Howie Huang,et al.  G-Store: High-Performance Graph Store for Trillion-Edge Processing , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Kun-Lung Wu,et al.  Streaming Algorithms for k-core Decomposition , 2013, Proc. VLDB Endow..

[19]  Heon Young Yeom,et al.  Dynamic Interval Polling and Pipelined Post I/O Processing for Low-Latency Storage Class Memory , 2013, HotStorage.

[20]  Steven Swanson,et al.  Providing safe, user space access to fast, solid state disks , 2012, ASPLOS XVII.

[21]  Bianca Schroeder,et al.  sRoute: Treating the Storage Stack Like a Network , 2016, FAST.

[22]  Mohan Kumar,et al.  Mosaic: Processing a Trillion-Edge Graph on a Single Machine , 2017, EuroSys.

[23]  Bianca Schroeder,et al.  Treating the Storage Stack Like a Network , 2017, ACM Trans. Storage.

[24]  Giri Narasimhan,et al.  CacheDedup: In-line Deduplication for Flash Caching , 2016, FAST.

[25]  Uzi Vishkin,et al.  An O(n² log n) Parallel MAX-FLOW Algorithm , 1982, J. Algorithms.

[26]  Changwoo Min,et al.  Understanding Manycore Scalability of File Systems , 2016, USENIX Annual Technical Conference.

[27]  Raju Rangaswami,et al.  I/O Deduplication: Utilizing content similarity to improve I/O performance , 2010, TOS.

[28]  Alexander S. Szalay,et al.  FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs , 2014, FAST.

[29]  H. Howie Huang,et al.  Graphene: Fine-Grained IO Management for Graph Computing , 2017, FAST.

[30]  Rajiv Gupta,et al.  Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing , 2016, USENIX Annual Technical Conference.

[31]  Raju Rangaswami,et al.  Non-blocking Writes to Files , 2015, FAST.

[32]  Xiaoqian Jiang,et al.  Fast and Robust Parallel SGD Matrix Factorization , 2015, KDD.

[33]  신웅 OS I/O path optimizations for flash solid-state drives , 2017 .

[34]  Andrea C. Arpaci-Dusseau,et al.  Proceedings of the 2002 Usenix Annual Technical Conference Bridging the Information Gap in Storage Protocol Stacks , 2022 .

[35]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[36]  Sun Zhen,et al.  Using Hints to Improve Inline Block-layer Deduplication , 2016, FAST.

[37]  Willy Zwaenepoel,et al.  Chaos: scale-out graph processing from secondary storage , 2015, SOSP.

[38]  Philippe Bonnet,et al.  Linux block IO: introducing multi-queue SSD access on multi-core systems , 2013, SYSTOR '13.

[39]  Rajesh K. Gupta,et al.  Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[40]  Erez Zadok,et al.  vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O , 2017, FAST.

[41]  Hyeonsang Eom,et al.  Optimizing the Block I/O Subsystem for Fast Storage Devices , 2014, ACM Trans. Comput. Syst..

[42]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.