Hierarchical Read–Write Optimizations for Scientific Applications with Multi-variable Structured Datasets

Large-scale scientific applications spend a significant amount of time in reading and writing data. These simulations run on supercomputers which are architected with high-bandwidth, low-latency, and complex topology interconnects. Yet, few efforts exist that fully exploit the interconnect features for I/O. MPI-IO optimizations suffer from significant network contention at large core counts making I/O a critical bottleneck at extreme scales. We propose HieRO, which leverages the fast interconnect and performs hierarchical optimizations for I/O in scientific applications with structured datasets. HieRO performs reads/writes in multiple stages using carefully chosen leader processes who invoke the MPI-IO calls. Additionally, HieRO considers the application’s domain decomposition and access patterns and fully utilizes the on-chip interconnect at each multicore node. We evaluate the efficacy of our optimizations with two scientific applications, WRF and S3D, with I/O access patterns commonly used in a wide gamut of applications. We evaluate our approaches on two supercomputers, the Edison Cray XC30 and the Mira Blue Gene/Q, representing systems with diverse interconnects and parallel filesystems. We demonstrate that algorithmic changes can lead to significant improvements in parallel read/write. HieRO is able to achieve more than $$40\times $$40× read time improvements for WRF and achieve up to $$40\times $$40× read and $$13\times $$13× write time improvements for S3D on 524288 cores.

[1]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[2]  Seung Woo Son,et al.  Improving collective I/O performance by pipelining request aggregation and file access , 2013, EuroMPI.

[3]  Robert Latham,et al.  Combining I/O operations for multiple array variables in parallel netCDF , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[4]  W. Collins,et al.  The Community Earth System Model: A Framework for Collaborative Research , 2013 .

[5]  Michael Gschwind,et al.  The IBM Blue Gene/Q Compute Chip , 2012, IEEE Micro.

[6]  Robert B. Ross,et al.  A New Flexible MPI Collective I/O Implementation , 2006, 2006 IEEE International Conference on Cluster Computing.

[7]  Edgar Gabriel,et al.  Automatically Selecting the Number of Aggregators for Collective I/O Operations , 2011, 2011 IEEE International Conference on Cluster Computing.

[8]  Philip Heidelberger,et al.  The IBM Blue Gene/Q interconnection network and message unit , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Hai Jin,et al.  Iteration Based Collective I/O Strategy for Parallel I/O Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[10]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[11]  Ibm Redbooks,et al.  IBM System Blue Gene Solution: Blue Gene/P Application Development , 2009 .

[12]  Ibm Redbooks IBM System Blue Gene Solution: Blue Gene/Q System Administration , 2012 .

[13]  Michael E. Papka,et al.  Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Robert Latham,et al.  I/O performance challenges at leadership scale , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[15]  Jianwei Li,et al.  Parallel netCDF: A High-Performance Scientific I/O Interface , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[16]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[17]  D.A. Reed,et al.  Input/Output Characteristics of Scalable Parallel Applications , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[18]  Torsten Hoefler,et al.  Design and Evaluation of Nonblocking Collective I/O Operations , 2011, EuroMPI.

[19]  Surendra Byna,et al.  Improving parallel I/O autotuning with performance modeling , 2014, HPDC '14.

[20]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[21]  G. Mahinthakumar,et al.  SCORPIO: A scalable two-phase parallel I/O library with application to a large scale subsurface simulator , 2013, HiPC.

[22]  Jimy Dudhia,et al.  The Weather Research and Forecast Model: software architecture and performance [presentation] , 2005 .

[23]  Prabhat,et al.  High Performance Parallel I/O , 2014 .

[24]  Yong Chen,et al.  Hierarchical I/O Scheduling for Collective I/O , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[25]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[26]  Rajeev Thakur,et al.  Optimizing noncontiguous accesses in MPI-IO , 2002, Parallel Comput..