Exploring the future of out-of-core computing with compute-local non-volatile memory

Drawing parallels to the rise of general purpose graphical processing units (GPGPUs) as accelerators for specific high-performance computing (HPC) workloads, there is a rise in the use of non-volatile memory (NVM) as accelerators for I/O-intensive scientific applications. However, existing works have explored use of NVM within dedicated I/O nodes, which are distant from the compute nodes that actually need such acceleration. As NVM bandwidth begins to out-pace point-to-point network capacity, we argue for the need to break from the archetype of completely separated storage. Therefore, in this work we investigate co-location of NVM and compute by varying I/O interfaces, file systems, types of NVM, and both current and future SSD architectures, uncovering numerous bottlenecks implicit in these various levels in the I/O stack. We present novel hardware and software solutions, including the new Unified File System (UFS), to enable fuller utilization of the new compute-local NVM storage. Our experimental evaluation, which employs a real-world Out-of-Core (OoC) HPC application, demonstrates throughput increases in excess of an order of magnitude over current approaches.

[1]  Michael L. Norman,et al.  Accelerating data-intensive science with Gordon and Dash , 2010 .

[2]  Chao Yang,et al.  Topology-Aware Mappings for Large-Scale Eigenvalue Problems , 2012, Euro-Par.

[3]  Steve Byan,et al.  Mercury: Host-side flash caching for the data center , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Jaechun No NAND flash memory-based hybrid file system for high I/O performance , 2012, J. Parallel Distributed Comput..

[5]  Michael M. Swift,et al.  FlashTier: a lightweight, consistent and durable storage cache , 2012, EuroSys '12.

[6]  Robert A. van de Geijn,et al.  Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems , 2009, 2009 Eighth International Symposium on Parallel and Distributed Computing.

[7]  Zheng Zhou,et al.  An Out-of-Core Dataflow Middleware to Reduce the Cost of Large Scale Iterative Solvers , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[8]  Kurt Mehlhorn,et al.  External-Memory Breadth-First Search with Sublinear I/O , 2002, ESA.

[9]  Mahmut T. Kandemir,et al.  Challenges in Getting Flash Drives Closer to CPU , 2013, HotStorage.

[10]  Yuan Xie,et al.  Hybrid checkpointing using emerging nonvolatile memories for future exascale systems , 2011, TACO.

[11]  David Flynn,et al.  DFS: A file system for virtualized flash storage , 2010, TOS.

[12]  David J. DeWitt,et al.  Turbocharging DBMS buffer pool using SSDs , 2011, SIGMOD '11.

[13]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[14]  Satoshi Matsuoka Making TSUBAME2.0, the world's greenest production supercomputer, even greener — Challenges to the architects , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[15]  Jin-Soo Kim,et al.  FlashLight , 2012, ACM Trans. Embed. Comput. Syst..

[16]  Stephen C. Tweedie,et al.  Journaling the Linux ext2fs Filesystem , 2008 .

[17]  Sandeep K. S. Gupta,et al.  DASH: a Recipe for a Flash-based Data Intensive Supercomputer , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Wei Hu,et al.  Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[19]  Chao Yang,et al.  Improving the scalability of a symmetric iterative eigensolver for multi‐core platforms , 2014, Concurr. Comput. Pract. Exp..

[20]  Eva Hocks,et al.  Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer , 2012, XSEDE '12.

[21]  Mahmut T. Kandemir,et al.  Revisiting widely held SSD expectations and rethinking system-level implications , 2013, SIGMETRICS '13.

[22]  John Shalf,et al.  NANDFlashSim: Intrinsic latency variation aware NAND flash memory system modeling and simulation at microarchitecture level , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[23]  Sivan Toledo,et al.  A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.

[24]  Xiaodong Zhang,et al.  Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[25]  Mahmut T. Kandemir,et al.  An Evaluation of Different Page Allocation Strategies on High-Speed SSDs , 2012, HotStorage.

[26]  Chao Wang,et al.  NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[27]  L. Vivier,et al.  The new ext 4 filesystem : current status and future plans , 2007 .

[28]  Mahmut T. Kandemir,et al.  Physically addressed queueing (PAQ): Improving parallelism in solid state disks , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[29]  Ming Zhao,et al.  Write policies for host-side flash caches , 2013, FAST.

[30]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[31]  Andrew V. Knyazev,et al.  Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method , 2001, SIAM J. Sci. Comput..

[32]  Jaemin Jung,et al.  FRASH: Hierarchical File System for FRAM and Flash , 2007, ICCSA.

[33]  Kai Shen,et al.  A performance evaluation of scientific I/O workloads on Flash-based SSDs , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[34]  Torsten Suel,et al.  Local methods for estimating pagerank values , 2004, CIKM '04.

[35]  Onur Mutlu,et al.  Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.