Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage

A surprising development in recently announced HPC platforms is the addition of, sometimes massive amounts of, persistent (nonvolatile) memory (NVM) in order to increase memory capacity and compensate for plateauing I/O capabilities. However, there are no portable and scalable programming interfaces using aggregate NVM effectively. This paper introduces Papyrus: a new software system built to exploit emerging capability of NVM in HPC architectures. Papyrus (or Parallel Aggregate Persistent -YRU- Storage) is a novel programming system that provides features for scalable, aggregate, persistent memory in an extreme-scale system for typical HPC usage scenarios. Papyrus mainly consists of Papyrus Virtual File System (VFS) and Papyrus Template Container Library (TCL). Papyrus VFS provides a uniform aggregate NVM storage image across diverse NVM architectures. It enables Papyrus TCL to provide a portable and scalable high-level container programming interface whose data elements are distributed across multiple NVM nodes without requiring the user to handle complex communication, synchronization, replication, and consistency model. We evaluate Papyrus on two HPC systems, including UTK Beacon and NERSC Cori, using real NVM storage devices.

[1]  Ada Gavrilovska,et al.  pVM: persistent virtual memory for efficient capacity scaling and object storage , 2016, EuroSys.

[2]  Rajesh K. Gupta,et al.  NV-Heaps: making persistent objects fast and safe with next-generation, non-volatile memories , 2011, ASPLOS XVI.

[3]  Constantine Bekas,et al.  Key/Value-Enabled Flash Memory for Complex Scientific Workflows with On-Line Analysis and Visualization , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  Emin Gün Sirer,et al.  HyperDex: a distributed, searchable key-value store , 2012, SIGCOMM '12.

[5]  Pradeep Dubey,et al.  Beacon: Deployment and Application of Intel Xeon Phi Coprocessorsfor Scientific Computing , 2015, Comput. Sci. Eng..

[6]  Beng Chin Ooi,et al.  In-Memory Big Data Management and Processing: A Survey , 2015, IEEE Transactions on Knowledge and Data Engineering.

[7]  Jeffrey S. Vetter,et al.  Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing , 2015, Computing in Science & Engineering.

[8]  Hal Finkel,et al.  HACC , 2016, Commun. ACM.

[9]  Onur Mutlu,et al.  Phase change memory architecture and the quest for scalability , 2010, Commun. ACM.

[10]  M.H. Kryder,et al.  After Hard Drives—What Comes Next? , 2009, IEEE Transactions on Magnetics.

[11]  Chao Wang,et al.  NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[12]  Seyong Lee,et al.  NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems , 2016, HPDC.

[13]  Kimberly Keeton,et al.  Proceedings of the Eleventh European Conference on Computer Systems , 2016, EuroSys.

[14]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  John L. Gustafson,et al.  Fixed Time, Tiered Memory, and Superlinear Speedup , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[16]  Youyou Lu,et al.  A high performance file system for non-volatile main memory , 2016, EuroSys.

[17]  Jeffrey S. Vetter,et al.  A Survey of Software Techniques for Using Non-Volatile Memories for Storage and Main Memory Systems , 2016, IEEE Transactions on Parallel and Distributed Systems.

[18]  Dhabaleswar K. Panda,et al.  A 1 PB/s file system to checkpoint three million MPI tasks , 2013, HPDC.

[19]  Anirudh Badam How Persistent Memory Will Change Software Systems , 2013, Computer.

[20]  Peter Druschel,et al.  Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles , 2011, SOSP 2011.

[21]  Brad Fitzpatrick,et al.  Distributed caching with memcached , 2004 .

[22]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[23]  Tao Zou,et al.  Tango: distributed data structures over a shared log , 2013, SOSP.

[24]  D. W. Walker,et al.  Proceedings of the fifth distributed memory computing conference , 1990 .

[25]  Bingsheng He,et al.  NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems , 2015, FAST.

[26]  Alan L. Cox,et al.  ThreadMarks: Shared Memory Computing on Networks of Workstations , 1996, Computer.

[27]  Jungwon Kim,et al.  IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism , 2016, HPDC.

[28]  Ethan L. Miller,et al.  Muninn: a Versioning Flash Key-Value Store Using an Object-based Storage Model , 2014, SYSTOR 2014.

[29]  Bin Fan,et al.  SILT: a memory-efficient, high-performance key-value store , 2011, SOSP.

[30]  Kai Li,et al.  Retrospective: virtual memory mapped network interface for the SHRIMP multicomputer , 1994, ISCA '98.

[31]  Craig Partridge,et al.  Proceedings of the conference on Applications, Technologies, Architectures, and Protocols for Computer Communication , 2000, SIGCOMM 2000.

[32]  Jeffrey S. Vetter,et al.  Contemporary High Performance Computing - From Petascale toward Exascale , 2019, Chapman and Hall / CRC computational science series.

[33]  Michael M. Swift,et al.  Mnemosyne: lightweight persistent memory , 2011, ASPLOS XVI.