Optimizing the Block I/O Subsystem for Fast Storage Devices

Fast storage devices are an emerging solution to satisfy data-intensive applications. They provide high transaction rates for DBMS, low response times for Web servers, instant on-demand paging for applications with large memory footprints, and many similar advantages for performance-hungry applications. In spite of the benefits promised by fast hardware, modern operating systems are not yet structured to take advantage of the hardware’s full potential. The software overhead caused by an OS, negligible in the past, adversely impacts application performance, lessening the advantage of using such hardware. Our analysis demonstrates that the overheads from the traditional storage-stack design are significant and cannot easily be overcome without modifying the hardware interface and adding new capabilities to the operating system. In this article, we propose six optimizations that enable an OS to fully exploit the performance characteristics of fast storage devices. With the support of new hardware interfaces, our optimizations minimize per-request latency by streamlining the I/O path and amortize per-request latency by maximizing parallelism inside the device. We demonstrate the impact on application performance through well-known storage benchmarks run against a Linux kernel with a customized SSD. We find that eliminating context switches in the I/O path decreases the software overhead of an I/O request from 20 microseconds to 5 microseconds and a new request merge scheme called Temporal Merge enables the OS to achieve 87% to 100% of peak device performance, regardless of request access patterns or types. Although the performance improvement by these optimizations on a standard SATA-based SSD is marginal (because of its limited interface and relatively high response times), our sensitivity analysis suggests that future SSDs with lower response times will benefit from these changes. The effectiveness of our optimizations encourages discussion between the OS community and storage vendors about future device interfaces for fast storage devices.

[1]  Wilson C. Hsieh,et al.  The logical disk: a new approach to improving file systems , 1994, SOSP '93.

[2]  Andrea C. Arpaci-Dusseau,et al.  Proceedings of the 2002 Usenix Annual Technical Conference Bridging the Information Gap in Storage Protocol Stacks , 2022 .

[3]  Khaled Salah,et al.  Implementation and experimental performance evaluation of a hybrid interrupt-handling scheme , 2009, Comput. Commun..

[4]  B. Dees Native command queuing - advanced performance in desktop storage , 2005, IEEE Potentials.

[5]  Jeffrey S. Chase,et al.  End system optimizations for high-speed TCP , 2001, IEEE Commun. Mag..

[6]  Frank Hady,et al.  When poll is better than interrupt , 2012, FAST.

[7]  David A. Patterson,et al.  Virtual log based file systems for a programmable disk , 1999, OSDI '99.

[8]  Gregory R. Ganger,et al.  Blurring the Line Between Oses and Storage Devices (CMU-CS-01-166) , 2001 .

[9]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[10]  Philippe Bonnet,et al.  Getting Priorities Straight: Improving Linux Support for Database I/O , 2005, VLDB.

[11]  유영진 Optimizing Block I/O Subsystem for Fast Storage Devices , 2012 .

[12]  Greg Kroah-Hartman,et al.  Linux Device Drivers, 3rd Edition , 2005 .

[13]  Andrea C. Arpaci-Dusseau,et al.  Life or Death at Block-Level , 2004, OSDI.

[14]  Peter Druschel,et al.  Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O , 2001, SOSP.

[15]  Eric Anderson,et al.  Proceedings of the Third Usenix Conference on File and Storage Technologies Buttress: a Toolkit for Flexible and High Fidelity I/o Benchmarking , 2022 .

[16]  Andrea C. Arpaci-Dusseau,et al.  Semantically-Smart Disk Systems , 2003, FAST.

[17]  Scott Rixner,et al.  Connection handoff policies for TCP offload network interfaces , 2006, OSDI '06.

[18]  Arun Jagatheesan,et al.  Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[19]  Jae-Myung Kim,et al.  A case for flash memory ssd in enterprise database applications , 2008, SIGMOD Conference.

[20]  Matthew Wilcox,et al.  Enhancements to Linux I/O Scheduling , 2005 .

[21]  Jiqiang Liu,et al.  Analysis of Interrupt Coalescing Schemes for Receive-Livelock Problem in Gigabit Ethernet Network Hosts , 2008, 2008 IEEE International Conference on Communications.

[22]  David Flynn,et al.  DFS: A file system for virtualized flash storage , 2010, TOS.

[23]  Margo I. Seltzer,et al.  Disk Scheduling Revisited , 1990 .

[24]  Irfan Ahmad,et al.  vIC: Interrupt Coalescing for Virtual Machine Storage Device IO , 2011, USENIX Annual Technical Conference.

[25]  David G. Andersen,et al.  The Case for VOS: The Vector Operating System , 2011, HotOS.

[26]  John Wilkes,et al.  An introduction to disk drive modeling , 1994, Computer.

[27]  Jongmoo Choi,et al.  Disk schedulers for solid state drivers , 2009, EMSOFT '09.

[28]  Hyeonsang Eom,et al.  Request Bridging and Interleaving: Improving the Performance of Small Synchronous Updates under Seek-Optimizing Disk Subsystems , 2011, TOS.

[29]  Jiuxing Liu,et al.  Virtualization polling engine (VPE): using dedicated CPU cores to accelerate I/O virtualization , 2009, ICS.

[30]  Nikolai Joukov,et al.  A nine year study of file system and storage benchmarking , 2008, TOS.

[31]  Anastasia Ailamaki,et al.  Atropos: A Disk Array Volume Manager for Orchestrated Use of Disks , 2004, FAST.

[32]  Erez Zadok,et al.  Selective Versioning in a Secure Disk System , 2008, USENIX Security Symposium.

[33]  Rajesh K. Gupta,et al.  Onyx: A Prototype Phase Change Memory Storage Array , 2011, HotStorage.

[34]  Evangelos Eleftheriou,et al.  Write amplification analysis in flash-based solid state drives , 2009, SYSTOR '09.

[35]  Muli Ben-Yehuda,et al.  Adding advanced storage controller functionality via low-overhead virtualization , 2012, FAST.

[36]  Hyeonsang Eom,et al.  Exploiting Peak Device Throughput from Random Access Workload , 2012, HotStorage.

[37]  Teruji Shiroshita A Data Processing Performance Model for the OSI Application Layer Protocols , 1990, SIGCOMM.

[38]  Marcus P. Dunn,et al.  A New I/O Scheduler for Solid State Devices , 2010 .

[39]  D. Niehaus Hrtimers and Beyond : Transforming the Linux Time Subsystems , 2009 .

[40]  Andrea C. Arpaci-Dusseau,et al.  Getting real: lessons in transitioning research simulations into hardware systems , 2013, FAST.

[41]  Alma Riska,et al.  Evaluating Block-level Optimization Through the IO Path , 2007, USENIX Annual Technical Conference.

[42]  Jongmoo Choi,et al.  SSD Characterization: From Energy Consumption's Perspective , 2011, HotStorage.

[43]  Srihari Makineni,et al.  Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[44]  Jeffrey Katcher,et al.  PostMark: A New File System Benchmark , 1997 .

[45]  Vijayalakshmi Srinivasan,et al.  Scalable high performance main memory system using phase-change memory technology , 2009, ISCA '09.

[46]  Christopher Frost,et al.  Better I/O through byte-addressable, persistent memory , 2009, SOSP '09.

[47]  Hyeonsang Eom,et al.  NCQ vs. I/O scheduler: Preventing unexpected misbehaviors , 2010, TOS.

[48]  Walter Hartner,et al.  FeRAM technology for high density applications , 2001, Microelectron. Reliab..

[49]  Erez Zadok,et al.  Tracefs: A File System to Trace Them All , 2004, FAST.

[50]  Robert Geist,et al.  A continuum of disk scheduling algorithms , 1987, TOCS.

[51]  Rajesh K. Gupta,et al.  Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[52]  Yale N. Patt,et al.  Using System-Level Models to Evaluate I/O Subsystem Designs , 1998, IEEE Trans. Computers.

[53]  Kanishk Jain Object-based Storage , 2022 .

[54]  Remzi H. Arpaci-Dusseau,et al.  Micro-Benchmark Based Extraction of Local and Global Disk , 2000 .

[55]  Yale N. Patt,et al.  Scheduling algorithms for modern disk drives , 1994, SIGMETRICS 1994.

[56]  Rina Panigrahy,et al.  Design Tradeoffs for SSD Performance , 2008, USENIX ATC.

[57]  Katerina J. Argyraki,et al.  RouteBricks: exploiting parallelism to scale software routers , 2009, SOSP '09.

[58]  Antony I. T. Rowstron,et al.  Migrating server storage to SSDs: analysis of tradeoffs , 2009, EuroSys '09.

[59]  Andrea C. Arpaci-Dusseau,et al.  De-indirection for flash-based SSDs with nameless writes , 2012, FAST.

[60]  Andrea C. Arpaci-Dusseau,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation 161 Avoiding File System Micromanagement with Range Writes , 2022 .

[61]  A. L. Narasimha Reddy,et al.  SCMFS: A file system for Storage Class Memory , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[62]  Hyojun Kim,et al.  Evaluating Phase Change Memory for Enterprise Storage Systems: A Study of Caching and Tiering Approaches , 2014, TOS.

[63]  David J. Lilja,et al.  High performance solid state storage under Linux , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[64]  Jim Zelenka,et al.  A cost-effective, high-bandwidth storage architecture , 1998, ASPLOS VIII.

[65]  Rob Williams,et al.  Linux device drivers , 2006 .

[66]  Heon Young Yeom,et al.  Shedding Light in the Black-Box : Structural Modeling of Modern Disk Drives , 2007, 2007 15th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[67]  Gregory R. Ganger,et al.  Towards higher disk head utilization: extracting free bandwidth from busy disk drives , 2000, OSDI.

[68]  HanfordNathan,et al.  Improving network performance on multicore systems , 2016 .

[69]  Jim Zelenka,et al.  File server scaling with network-attached secure disks , 1997, SIGMETRICS '97.

[70]  Jun Yang,et al.  A durable and energy efficient main memory using phase change memory technology , 2009, ISCA '09.

[71]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[72]  Winfried W. Wilcke,et al.  Storage-class memory: The next storage system technology , 2008, IBM J. Res. Dev..

[73]  Willy Zwaenepoel,et al.  Optimizing TCP Receive Performance , 2008, USENIX ATC.

[74]  Anand Sivasubramaniam,et al.  Storage performance virtualization via throughput and latency control , 2005, 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[75]  John Wilkes,et al.  Disk scheduling algorithms based on rotational position , 1991 .

[76]  M. Hosomi,et al.  A novel nonvolatile memory with spin torque transfer magnetization switching: spin-ram , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[77]  Xiaodong Zhang,et al.  Understanding intrinsic characteristics and system implications of flash memory based solid state drives , 2009, SIGMETRICS '09.

[78]  Peter Druschel,et al.  Soft timers: efficient microsecond software timer support for network processing , 1999, SOSP.

[79]  Ozalp Babaoglu,et al.  ACM Transactions on Computer Systems , 2007 .

[80]  Sangjin Han,et al.  PacketShader: a GPU-accelerated software router , 2010, SIGCOMM '10.

[81]  Tei-Wei Kuo,et al.  An adaptive striping architecture for flash memory storage systems of embedded systems , 2002, Proceedings. Eighth IEEE Real-Time and Embedded Technology and Applications Symposium.

[82]  Heon Young Yeom,et al.  Dynamic Interval Polling and Pipelined Post I/O Processing for Low-Latency Storage Class Memory , 2013, HotStorage.

[83]  Steven Swanson,et al.  Providing safe, user space access to fast, solid state disks , 2012, ASPLOS XVII.

[84]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[85]  Erez Zadok,et al.  Type-safe disks , 2006, OSDI '06.

[86]  Shih-Hung Chen,et al.  Phase-change random access memory: A scalable technology , 2008, IBM J. Res. Dev..

[87]  Sang Lyul Min,et al.  Ozone (O3): An Out-of-Order Flash Memory Controller Architecture , 2011, IEEE Transactions on Computers.

[88]  Jihong Kim,et al.  BlueSSD: An Open Platform for Cross-layer Experiments for NAND Flash-based SSDs , 2010 .

[89]  Willy Zwaenepoel,et al.  IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[90]  Thorsten von Eicken,et al.  U-Net: a user-level network interface for parallel and distributed computing , 1995, SOSP.

[91]  Khaled Salah,et al.  Performance analysis and comparison of interrupt-handling schemes in gigabit networks , 2007, Comput. Commun..