论文信息 - Hierarchical Read–Write Optimizations for Scientific Applications with Multi-variable Structured Datasets

Hierarchical Read–Write Optimizations for Scientific Applications with Multi-variable Structured Datasets

Large-scale scientific applications spend a significant amount of time in reading and writing data. These simulations run on supercomputers which are architected with high-bandwidth, low-latency, and complex topology interconnects. Yet, few efforts exist that fully exploit the interconnect features for I/O. MPI-IO optimizations suffer from significant network contention at large core counts making I/O a critical bottleneck at extreme scales. We propose HieRO, which leverages the fast interconnect and performs hierarchical optimizations for I/O in scientific applications with structured datasets. HieRO performs reads/writes in multiple stages using carefully chosen leader processes who invoke the MPI-IO calls. Additionally, HieRO considers the application’s domain decomposition and access patterns and fully utilizes the on-chip interconnect at each multicore node. We evaluate the efficacy of our optimizations with two scientific applications, WRF and S3D, with I/O access patterns commonly used in a wide gamut of applications. We evaluate our approaches on two supercomputers, the Edison Cray XC30 and the Mira Blue Gene/Q, representing systems with diverse interconnects and parallel filesystems. We demonstrate that algorithmic changes can lead to significant improvements in parallel read/write. HieRO is able to achieve more than $$40\times $$40× read time improvements for WRF and achieve up to $$40\times $$40× read and $$13\times $$13× write time improvements for S3D on 524288 cores.

Venkatram Vishwanath | Preeti Malakar | V. Vishwanath | Preeti Malakar

[1] Frank B. Schmuck,et al. GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[2] Seung Woo Son,et al. Improving collective I/O performance by pipelining request aggregation and file access , 2013, EuroMPI.

[3] Robert Latham,et al. Combining I/O operations for multiple array variables in parallel netCDF , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[4] W. Collins,et al. The Community Earth System Model: A Framework for Collaborative Research , 2013 .

[5] Michael Gschwind,et al. The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[6] Robert B. Ross,et al. A New Flexible MPI Collective I/O Implementation , 2006, 2006 IEEE International Conference on Cluster Computing.

[7] Edgar Gabriel,et al. Automatically Selecting the Number of Aggregators for Collective I/O Operations , 2011, 2011 IEEE International Conference on Cluster Computing.

[8] Philip Heidelberger,et al. The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9] Hai Jin,et al. Iteration Based Collective I/O Strategy for Parallel I/O Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[10] William J. Dally,et al. Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[11] Ibm Redbooks,et al. IBM System Blue Gene Solution: Blue Gene/P Application Development , 2009 .

[12] Ibm Redbooks. IBM System Blue Gene Solution: Blue Gene/Q System Administration , 2012 .

[13] Michael E. Papka,et al. Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14] Robert Latham,et al. I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[15] Jianwei Li,et al. Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[16] Andrew J. Hutton,et al. Lustre: Building a File System for 1,000-node Clusters , 2003 .

[17] D.A. Reed,et al. Input/Output Characteristics of Scalable Parallel Applications , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[18] Torsten Hoefler,et al. Design and Evaluation of Nonblocking Collective I/O Operations , 2011, EuroMPI.

[19] Surendra Byna,et al. Improving parallel I/O autotuning with performance modeling , 2014, HPDC '14.

[20] Torsten Hoefler,et al. The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[21] G. Mahinthakumar,et al. SCORPIO: A scalable two-phase parallel I/O library with application to a large scale subsurface simulator , 2013, HiPC.

[22] Jimy Dudhia,et al. The Weather Research and Forecast Model: software architecture and performance [presentation] , 2005 .

[23] Prabhat,et al. High Performance Parallel I/O , 2014 .

[24] Yong Chen,et al. Hierarchical I/O Scheduling for Collective I/O , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[25] Scott Klasky,et al. Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[26] Rajeev Thakur,et al. Optimizing noncontiguous accesses in MPI-IO , 2002, Parallel Comput..