论文信息 - Exploring the future of out-of-core computing with compute-local non-volatile memory

Exploring the future of out-of-core computing with compute-local non-volatile memory

Drawing parallels to the rise of general purpose graphical processing units (GPGPUs) as accelerators for specific high-performance computing (HPC) workloads, there is a rise in the use of non-volatile memory (NVM) as accelerators for I/O-intensive scientific applications. However, existing works have explored use of NVM within dedicated I/O nodes, which are distant from the compute nodes that actually need such acceleration. As NVM bandwidth begins to out-pace point-to-point network capacity, we argue for the need to break from the archetype of completely separated storage. Therefore, in this work we investigate co-location of NVM and compute by varying I/O interfaces, file systems, types of NVM, and both current and future SSD architectures, uncovering numerous bottlenecks implicit in these various levels in the I/O stack. We present novel hardware and software solutions, including the new Unified File System (UFS), to enable fuller utilization of the new compute-local NVM storage. Our experimental evaluation, which employs a real-world Out-of-Core (OoC) HPC application, demonstrates throughput increases in excess of an order of magnitude over current approaches.

[1] Michael L. Norman,et al. Accelerating data-intensive science with Gordon and Dash , 2010 .

[2] Chao Yang,et al. Topology-Aware Mappings for Large-Scale Eigenvalue Problems , 2012, Euro-Par.

[3] Steve Byan,et al. Mercury: Host-side flash caching for the data center , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[4] Jaechun No. NAND flash memory-based hybrid file system for high I/O performance , 2012, J. Parallel Distributed Comput..

[5] Michael M. Swift,et al. FlashTier: a lightweight, consistent and durable storage cache , 2012, EuroSys '12.

[6] Robert A. van de Geijn,et al. Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems , 2009, 2009 Eighth International Symposium on Parallel and Distributed Computing.

[7] Zheng Zhou,et al. An Out-of-Core Dataflow Middleware to Reduce the Cost of Large Scale Iterative Solvers , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[8] Kurt Mehlhorn,et al. External-Memory Breadth-First Search with Sublinear I/O , 2002, ESA.

[9] Mahmut T. Kandemir,et al. Challenges in Getting Flash Drives Closer to CPU , 2013, HotStorage.

[10] Yuan Xie,et al. Hybrid checkpointing using emerging nonvolatile memories for future exascale systems , 2011, TACO.

[11] David Flynn,et al. DFS: A file system for virtualized flash storage , 2010, TOS.

[12] David J. DeWitt,et al. Turbocharging DBMS buffer pool using SSDs , 2011, SIGMOD '11.

[13] Joel H. Saltz,et al. Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[14] Satoshi Matsuoka. Making TSUBAME2.0, the world's greenest production supercomputer, even greener — Challenges to the architects , 2011, IEEE/ACM International Symposium on Low Power Electronics and Design.

[15] Jin-Soo Kim,et al. FlashLight , 2012, ACM Trans. Embed. Comput. Syst..

[16] Stephen C. Tweedie,et al. Journaling the Linux ext2fs Filesystem , 2008 .

[17] Sandeep K. S. Gupta,et al. DASH: a Recipe for a Flash-based Data Intensive Supercomputer , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[18] Wei Hu,et al. Scalability in the XFS File System , 1996, USENIX Annual Technical Conference.

[19] Chao Yang,et al. Improving the scalability of a symmetric iterative eigensolver for multi‐core platforms , 2014, Concurr. Comput. Pract. Exp..

[20] Eva Hocks,et al. Gordon: design, performance, and experiences deploying and supporting a data intensive supercomputer , 2012, XSEDE '12.

[21] Mahmut T. Kandemir,et al. Revisiting widely held SSD expectations and rethinking system-level implications , 2013, SIGMETRICS '13.

[22] John Shalf,et al. NANDFlashSim: Intrinsic latency variation aware NAND flash memory system modeling and simulation at microarchitecture level , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[23] Sivan Toledo,et al. A survey of out-of-core algorithms in numerical linear algebra , 1999, External Memory Algorithms.

[24] Xiaodong Zhang,et al. Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[25] Mahmut T. Kandemir,et al. An Evaluation of Different Page Allocation Strategies on High-Speed SSDs , 2012, HotStorage.

[26] Chao Wang,et al. NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[27] L. Vivier,et al. The new ext 4 filesystem : current status and future plans , 2007 .

[28] Mahmut T. Kandemir,et al. Physically addressed queueing (PAQ): Improving parallelism in solid state disks , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[29] Ming Zhao,et al. Write policies for host-side flash caches , 2013, FAST.

[30] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[31] Andrew V. Knyazev,et al. Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method , 2001, SIAM J. Sci. Comput..

[32] Jaemin Jung,et al. FRASH: Hierarchical File System for FRAM and Flash , 2007, ICCSA.

[33] Kai Shen,et al. A performance evaluation of scientific I/O workloads on Flash-based SSDs , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[34] Torsten Suel,et al. Local methods for estimating pagerank values , 2004, CIKM '04.

[35] Onur Mutlu,et al. Architecting phase change memory as a scalable dram alternative , 2009, ISCA '09.