NVMM-Oriented Hierarchical Persistent Client Caching for Lustre

In high-performance computing (HPC), data and metadata are stored on special server nodes and client applications access the servers’ data and metadata through a network, which induces network latencies and resource contention. These server nodes are typically equipped with (slow) magnetic disks, while the client nodes store temporary data on fast SSDs or even on non-volatile main memory (NVMM). Therefore, the full potential of parallel file systems can only be reached if fast client side storage devices are included into the overall storage architecture. In this article, we propose an NVMM-based hierarchical persistent client cache for the Lustre file system (NVMM-LPCC for short). NVMM-LPCC implements two caching modes: a read and write mode (RW-NVMM-LPCC for short) and a read only mode (RO-NVMM-LPCC for short). NVMM-LPCC integrates with the Lustre Hierarchical Storage Management (HSM) solution and the Lustre layout lock mechanism to provide consistent persistent caching services for I/O applications running on client nodes, meanwhile maintaining a global unified namespace of the entire Lustre file system. The evaluation results presented in this article show that NVMM-LPCC can increase the average read throughput by up to 35.80 times and the average write throughput by up to 9.83 times compared with the native Lustre system, while providing excellent scalability.

[1]  Tao Li,et al.  Octopus: an RDMA-enabled Distributed Persistent Memory File System , 2017, USENIX ATC.

[2]  John Shalf,et al.  Using IOR to analyze the I/O Performance for HPC Platforms , 2007 .

[3]  Hisashi Shima,et al.  Resistive Random Access Memory (ReRAM) Based on Metal Oxides , 2010, Proceedings of the IEEE.

[4]  Jean Luca Bez,et al.  A Checkpoint of Research on Parallel I/O for High-Performance Computing , 2018, ACM Comput. Surv..

[5]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[6]  Karsten Schwan,et al.  NVRAM-aware Logging in Transaction Systems , 2014, Proc. VLDB Endow..

[7]  André Brinkmann,et al.  A Configurable Rule based Classful Token Bucket Filter Network Request Scheduler for the Lustre File System , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Tim Süß,et al.  GekkoFS — A Temporary Burst Buffer File System for HPC Applications , 2020, Journal of Computer Science and Technology.

[9]  Yiying Zhang,et al.  Distributed shared persistent memory , 2017, SoCC.

[10]  Dean Hildebrand,et al.  Panache: A Parallel File System Cache for Global File Access , 2010, FAST.

[11]  Jian Xu,et al.  NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories , 2016, FAST.

[12]  Mark Parsons,et al.  Architectures for High Performance Computing and Data Systems using Byte-Addressable Persistent Memory , 2019, ISC Workshops.

[13]  Michael M. Swift,et al.  Aerie: flexible file-system interfaces to storage-class memory , 2014, EuroSys '14.

[14]  Haibo Chen,et al.  Performance and protection in the ZoFS user-space NVM file system , 2019, SOSP.

[15]  Peter Braam,et al.  The Lustre Storage Architecture , 2019, ArXiv.

[16]  Morteza Hoseinzadeh,et al.  A Survey on Tiering and Caching in High-Performance Storage Systems , 2019, ArXiv.

[17]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[18]  Michael M. Swift,et al.  Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.

[19]  Liu Shi,et al.  BWCC: A FS-Cache Based Cooperative Caching System for Network Storage System , 2012, 2012 IEEE International Conference on Cluster Computing.

[20]  Herodotos Herodotou,et al.  Automating Distributed Tiered Storage Management in Cluster Computing , 2019, Proc. VLDB Endow..

[21]  Erez Zadok,et al.  Filebench: A Flexible Framework for File System Benchmarking , 2016, login Usenix Mag..

[22]  Walter Hartner,et al.  FeRAM technology for high density applications , 2001, Microelectron. Reliab..

[23]  Dhabaleswar K. Panda,et al.  High Performance Design for HDFS with Byte-Addressability of NVM and RDMA , 2016, ICS.

[24]  Steven Swanson,et al.  Breeze: User-Level Access to Non-Volatile Main Memories for Legacy Software , 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD).

[25]  Robert B. Ross,et al.  Ad Hoc File Systems for High-Performance Computing , 2020, Journal of Computer Science and Technology.

[26]  Dan Feng,et al.  LPCC: hierarchical persistent client caching for lustre , 2019, SC.

[27]  Teng Wang,et al.  An Ephemeral Burst-Buffer File System for Scientific Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[28]  Youyou Lu,et al.  HiNFS , 2018, ACM Trans. Storage.

[29]  Sanjay Kumar,et al.  System software for persistent memory , 2014, EuroSys '14.

[30]  Gang Wang,et al.  Performance Analysis of 3D XPoint SSDs in Virtualized and Non-Virtualized Environments , 2018, 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS).

[31]  Ali Raza Butt,et al.  On Efficient Hierarchical Storage for Big Data Processing , 2016, 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid).

[32]  Taesoo Kim,et al.  SplitFS: reducing software overhead in file systems for persistent memory , 2019, SOSP.

[33]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.

[34]  M. Breitwisch Phase Change Memory , 2008, 2008 International Interconnect Technology Conference.

[35]  André Brinkmann,et al.  Improving Collective I/O Performance Using Non-volatile Memory Devices , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[36]  Lipeng Wan,et al.  Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems , 2017, J. Parallel Distributed Comput..