MLBS: Transparent Data Caching in Hierarchical Storage for Out-of-Core HPC Applications

Out-of-core simulation systems produce and/or consume a massive amount of data that cannot fit on a single compute node memory and that usually needs to be read and/or written back and forth during computation. I/O data movement may thus represent a bottleneck in large-scale simulations. To increase I/O bandwidth, high-end supercomputers are equipped with hierarchical storage subsystems such as node-local and remote-shared NVMe and SSD-based Burst Buffers. Advanced caching systems have recently been developed to efficiently utilize the multi-layered nature of the new storage hierarchy. Utilization of software components results in more efficient data accesses, at the cost of reduced computation kernel performance and limited numbers of simultaneous applications that can utilize the additional storage layers. We introduce MultiLayered Buffer Storage (MLBS), a data object container that provides novel methods for caching and prefetching data in out-of-core scientific applications to perform asynchronously expensive I/O operations on systems equipped with hierarchical storage. The main idea consists in decoupling I/O operations from computational phases using dedicated hardware resources to perform expensive context switches. MLBS monitors I/O traffic in each storage layer allowing fair utilization of shared resources while controlling the impact on kernels' performance. By continually prefetching up and down across all hardware layers of the memory/storage subsystems, MLBS transforms the original I/O-bound behavior of evaluated applications and shifts it closer to a memory-bound regime. Our evaluation on a Cray XC40 system for a representative I/O-bound application, seismic inversion, shows that MLBS outperforms state-of-the-art filesystems, i.e., Lustre, Data Elevator and DataWarp by 6.06X, 2.23X, and 1.90X, respectively.

[1]  William W. Symes,et al.  Reverse time migration with optimal checkpointing , 2007 .

[2]  Xian-He Sun,et al.  Harmonia: An Interference-Aware Dynamic I/O Scheduler for Shared Non-volatile Burst Buffers , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Hal Finkel,et al.  HACC: Simulating Sky Surveys on State-of-the-Art Supercomputing Architectures , 2014, 1410.2805.

[4]  Teng Wang,et al.  BurstMem: A high-performance burst buffer system for scientific applications , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[5]  Nicholas J. Wright,et al.  Architecture and Design of Cray DataWarp , 2016 .

[6]  Limin Xiao,et al.  A New File-Specific Stripe Size Selection Method for Highly Concurrent Data Access , 2012, 2012 ACM/IEEE 13th International Conference on Grid Computing.

[7]  Vivek S. Pai,et al.  SSDAlloc: Hybrid SSD/RAM Memory Management Made Easy , 2011, NSDI.

[8]  Yue Wang,et al.  REVERSE-TIME MIGRATION , 1999 .

[9]  Teng Wang,et al.  An Ephemeral Burst-Buffer File System for Scientific Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  Prabhat,et al.  Storage 2020: A Vision for the Future of HPC Storage , 2017 .

[11]  Karsten Schwan,et al.  Managing Variability in the IO Performance of Petascale Storage Systems , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[13]  Lustre : A Scalable , High-Performance File System Cluster , 2003 .

[14]  Houjun Tang,et al.  ARCHIE: Data Analysis Acceleration with Array Caching in Hierarchical Storage , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[15]  Houjun Tang,et al.  UniviStor: Integrated Hierarchical and Distributed Storage for HPC , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[16]  Scott Klasky,et al.  Predicting Output Performance of a Petascale Supercomputer , 2017, HPDC.

[17]  Kesheng Wu,et al.  Data Elevator: Low-Contention Data Movement in Hierarchical Storage System , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[18]  Jack J. Dongarra,et al.  Collecting Performance Data with PAPI-C , 2009, Parallel Tools Workshop.

[19]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[20]  Robert B. Ross,et al.  On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[21]  David R. Musser,et al.  STL tutorial and reference guide - C++ programming with the standard template library , 1996, Addison-Wesley professional computing series.

[22]  Jun Yang,et al.  Data Management in Machine Learning: Challenges, Techniques, and Systems , 2017, SIGMOD Conference.

[23]  Andrew Pavlo,et al.  Write-Behind Logging , 2016, Proc. VLDB Endow..

[24]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[25]  Paul Sava,et al.  Overview and classification of wavefield seismic imaging methods , 2009 .

[26]  Arun Jagatheesan,et al.  Understanding the Impact of Emerging Non-Volatile Memories on High-Performance, IO-Intensive Computing , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[27]  Chao Wang,et al.  NVMalloc: Exposing an Aggregate SSD Store as a Memory Partition in Extreme-Scale Machines , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[28]  Arie Shoshani,et al.  Scientific Data Management - Challenges, Technology, and Deployment , 2009, Scientific Data Management.

[29]  Xian-He Sun,et al.  Hermes: a heterogeneous-aware multi-tiered distributed I/O buffering system , 2018, HPDC.