FlashShare: Punching Through Server Storage Stack from Kernel to Firmware for Ultra-Low Latency SSDs

A modern datacenter server aims to achieve high energy efficiency by co-running multiple applications. Some of such applications (e.g., web search) are latency sensitive. Therefore, they require low-latency I/O services to fast respond to requests from clients. However, we observe that simply replacing the storage devices of servers with Ultra-Low-Latency (ULL) SSDs does not notably reduce the latency of I/O services, especially when co-running multiple applications. In this paper, we propose FLASHSHARE to assist ULL SSDs to satisfy different levels of I/O service latency requirements for different co-running applications. Specifically, FLASHSHARE is a holistic cross-stack approach, which can significantly reduce I/O interferences among co-running applications at a server without any change in applications. At the kernel-level, we extend the data structures of the storage stack to pass attributes of (co-running) applications through all the layers of the underlying storage stack spanning from the OS kernel to the SSD firmware. For given attributes, the block layer and NVMe driver of FLASHSHARE differently manage the I/O scheduler and interrupt handler of NVMe. We also enhance the NVMe controller and cache layer at the SSD firmware-level, by dynamically partitioning DRAM in the ULL SSD and adjusting its caching strategies to meet diverse user requirements. The evaluation results demonstrate that FLASHSHARE can shorten the average and 99th-percentile turnaround response times of co-running applications by 22% and 31%, respectively.

[1]  Isabella Stilkerich,et al.  Cooperative Memory Management in Safety-Critical Embedded Systems , 2016 .

[2]  Paul H. Siegel,et al.  Characterizing flash memory: Anomalies, observations, and applications , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Nimrod Megiddo,et al.  ARC: A Self-Tuning, Low Overhead Replacement Cache , 2003, FAST.

[4]  Mahmut T. Kandemir,et al.  Amber*: Enabling Precise Full-System Simulation with Detailed Modeling of All SSD Resources , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Li-Pin Chang,et al.  On efficient wear leveling for large-scale flash-memory storage systems , 2007, SAC '07.

[6]  John Shalf,et al.  SimpleSSD: Modeling Solid State Drives for Holistic System Simulation , 2017, IEEE Computer Architecture Letters.

[7]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[8]  Kevin Skadron,et al.  Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  David Hung-Chang Du,et al.  Rejuvenator: A static wear leveling algorithm for NAND flash memory with minimized overhead , 2011, 2011 IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST).

[10]  Michael P. Mesnier,et al.  Differentiated storage services , 2011, OPSR.

[11]  Hwanju Kim,et al.  Enlightening the I/O Path: A Holistic Approach for Application Performance , 2017, FAST.

[12]  Yonggang Wen,et al.  Energy efficiency and server virtualization in data centers: An empirical investigation , 2012, 2012 Proceedings IEEE INFOCOM Workshops.

[13]  Youjip Won,et al.  I/O Stack Optimization for Smartphones , 2013, USENIX ATC.

[14]  Yuqing Zhu,et al.  BigDataBench: A big data benchmark suite from internet services , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[15]  Changho Choi,et al.  AutoStream: automatic stream management for multi-streamed SSDs , 2017, SYSTOR.

[16]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[17]  Christoforos E. Kozyrakis,et al.  Reconciling high server utilization and sub-millisecond quality-of-service , 2014, EuroSys '14.

[18]  Sam H. Noh,et al.  Towards SLO Complying SSDs Through OPS Isolation , 2015, FAST.

[19]  Mohammad Alian,et al.  NCAP: Network-Driven, Packet Context-Aware Power Management for Client-Server Architecture , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[20]  Jim Zelenka,et al.  Informed prefetching and caching , 1995, SOSP.

[21]  Christoforos E. Kozyrakis,et al.  Towards energy proportionality for large-scale latency-critical workloads , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[22]  Christoforos E. Kozyrakis,et al.  Heracles: Improving resource efficiency at scale , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[23]  Frank Hady,et al.  When poll is better than interrupt , 2012, FAST.

[24]  Ki-Whan Song,et al.  A flash memory controller for 15μs ultra-low-latency SSD using high-speed 3D NAND flash with 3μs read time , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[25]  Andrea C. Arpaci-Dusseau,et al.  Split-level I/O scheduling , 2015, SOSP.

[26]  Yannis Papakonstantinou,et al.  SSD In-Storage Computing for Search Engines , 2016 .

[27]  Sachin Katti,et al.  Reducing DRAM footprint with NVM in Facebook , 2018, EuroSys.

[28]  Mahmut T. Kandemir,et al.  Revisiting widely held SSD expectations and rethinking system-level implications , 2013, SIGMETRICS '13.

[29]  Rajesh K. Gupta,et al.  Moneta: A High-Performance Storage Array Architecture for Next-Generation, Non-volatile Memories , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Joo Young Hwang,et al.  FStream: Managing Flash Streams in the File System , 2018, FAST.

[31]  Kai Li,et al.  Management of Multilevel, Multiclient Cache Hierarchies with Application Hints , 2011, TOCS.

[32]  Steven Swanson,et al.  The Harey Tortoise: Managing Heterogeneous Write Performance in SSDs , 2013, USENIX Annual Technical Conference.

[33]  Myoungsoo Jung Exploring Parallel Data Access Methods in Emerging Non-Volatile Memory Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[34]  Emery D. Berger,et al.  Usenix Association 8th Usenix Symposium on Operating Systems Design and Implementation 73 Redline: First Class Support for Interactivity in Commodity Operating Systems , 2022 .

[35]  Sungjin Lee,et al.  PCStream: Automatic Stream Allocation Using Program Contexts , 2018, HotStorage.

[36]  Mahmut Kandemir,et al.  Middleware - firmware cooperation for high-speed solid state drives , 2012, Middleware '12.

[37]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[38]  Tei-Wei Kuo,et al.  Real-time garbage collection for flash-memory storage systems of real-time embedded systems , 2004, TECS.

[39]  Jin-Soo Kim,et al.  vStream: Virtual Stream Management for Multi-streamed SSDs , 2018, HotStorage.

[40]  Nathan Farrington,et al.  Facebook's data center network architecture , 2013, 2013 Optical Interconnects Conference.

[41]  Jin-Soo Kim,et al.  An adaptive partitioning scheme for DRAM-based cache in Solid State Drives , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[42]  Steven Swanson,et al.  DC express: shortest latency protocol for reading phase change memory over PCI express , 2014, FAST.