TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers

Reading and writing data efficiently from storage system is necessary for most scientific simulations to achieve good performance at scale. Many software solutions have been developed to decrease the I/O bottleneck. One well-known strategy, in the context of collective I/O operations, is the two-phase I/O scheme. This strategy consists of selecting a subset of processes to aggregate contiguous pieces of data before performing reads/writes. In this paper, we present TAPIOCA, an MPI-based library implementing an efficient topology-aware two-phase I/O algorithm. We show how TAPIOCA can take advantage of double-buffering and one-sided communication to reduce as much as possible the idle time during data aggregation. We also introduce our cost model leading to a topology-aware aggregator placement optimizing the movements of data. We validate our approach at large scale on two leadership-class supercomputers: Mira (IBM BG/Q) and Theta (Cray XC40). We present the results obtained with TAPIOCA on a micro-benchmark and the I/O kernel of a large-scale simulation. On both architectures, we show a substantial improvement of I/O performance compared with the default MPI I/O implementation. On BG/Q+GPFS, for instance, our algorithm leads to a performance improvement by a factor of twelve while on the Cray XC40 system associated with a Lustre filesystem, we achieve an improvement of four.

[1]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[2]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[3]  Rajeev Thakur,et al.  Optimizing noncontiguous accesses in MPI-IO , 2002, Parallel Comput..

[4]  Edgar Gabriel,et al.  Performance Evaluation of Collective Write Algorithms in MPI I/O , 2009, ICCS.

[5]  Michael E. Papka,et al.  Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[6]  William Gropp,et al.  MPICH2: A New Start for MPI Implementations , 2002, PVM/MPI.

[7]  Ibm Redbooks,et al.  IBM System Blue Gene Solution: Blue Gene/P Application Development , 2009 .

[8]  Alok N. Choudhary,et al.  Improved parallel I/O via a two-phase run-time access strategy , 1993, CARN.

[9]  Florin Isaila,et al.  Topology-Aware Data Aggregation for Intensive I/O on Large-Scale Supercomputers , 2016, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC).

[10]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[11]  Michael E. Papka,et al.  Improving Data Movement Performance for Sparse Data Patterns on the Blue Gene/Q Supercomputer , 2014, 2014 43rd International Conference on Parallel Processing Workshops.

[12]  Kevin Harms,et al.  Scalable Parallel I/O on a Blue Gene/Q Supercomputer Using Compression, Topology-Aware Data Aggregation, and Subfiling , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[13]  Yutaka Ishikawa,et al.  Multithreaded Two-Phase I/O: Improving Collective MPI-IO Performance on a Lustre File System , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[14]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[15]  Rajeev Thakur,et al.  A Case for Using MPI's Derived Datatypes to Improve I/O Performance , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[16]  Marianne Winslett,et al.  A Multiplatform Study of I/O Behavior on Petascale Supercomputers , 2015, HPDC.

[17]  Jaspal Subhlok,et al.  Optimized process placement for collective I/O operations , 2013, EuroMPI.

[18]  Andrew J. Hutton,et al.  Lustre: Building a File System for 1,000-node Clusters , 2003 .

[19]  Yutaka Ishikawa,et al.  Improving collective I/O performance using pipelined two-phase I/O , 2012, HiPC 2012.

[20]  Jesús Carretero,et al.  Data Locality Aware Strategy for Two-Phase Collective I/O , 2008, VECPAR.

[21]  Florin Isaila,et al.  Collective I/O Tuning Using Analytical and Machine Learning Models , 2015, 2015 IEEE International Conference on Cluster Computing.

[22]  Edgar Gabriel,et al.  Automatically Selecting the Number of Aggregators for Collective I/O Operations , 2011, 2011 IEEE International Conference on Cluster Computing.

[23]  Q. Koziol,et al.  Tuning Parallel I/O on Blue Waters for Writing 10 Trillion Particles , 2015 .

[24]  Surendra Byna,et al.  Recent Progress in Tuning Performance of Large-scale I / O with Parallel HDF 5 , 2014 .