Improving MPI Collective I/O Performance With Intra-node Request Aggregation

Two-phase I/O is a well-known strategy for implementing collective MPI-IO functions. It redistributes I/O requests among the calling processes into a form that minimizes the file access costs. As modern parallel computers continue to grow into the exascale era, the communication cost of such request redistribution can quickly overwhelm collective I/O performance. This effect has been observed from parallel jobs that run on multiple compute nodes with a high count of MPI processes on each node. To reduce the communication cost, we present a new design for collective I/O by adding an extra communication layer that performs request aggregation among processes within the same compute nodes. This approach can significantly reduce inter-node communication congestion when redistributing the I/O requests. We evaluate the performance and compare with the original two-phase I/O on a Cray XC40 parallel computer with Intel KNL processors. Using I/O patterns from two large-scale production applications and an I/O benchmark, we show the performance improvement of up to 29 times when running 16384 MPI processes on 256 compute nodes.

[1]  Seung Woo Son,et al.  Improving collective I/O performance by pipelining request aggregation and file access , 2013, EuroMPI.

[2]  Michael E. Papka,et al.  Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Rajeev Thakur,et al.  Non-data-communication Overheads in MPI: Analysis on Blue Gene/P , 2008, PVM/MPI.

[4]  Hal Finkel,et al.  HACC: Simulating Sky Surveys on State-of-the-Art Supercomputing Architectures , 2014, 1410.2805.

[5]  Kesheng Wu,et al.  Data Elevator: Low-Contention Data Movement in Hierarchical Storage System , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[6]  Seung Ryoul Maeng,et al.  Reducing communication costs in collective I/O in multi-core cluster systems with non-exclusive scheduling , 2011, The Journal of Supercomputing.

[7]  C. Law,et al.  Direct Numerical Simulations of Turbulent Lean Premixed Combustion. , 2006 .

[8]  Yang Wang,et al.  Heterogeneity-Aware Collective I/O for Parallel I/O Systems with Hybrid HDD/SSD Servers , 2017, IEEE Transactions on Computers.

[9]  Edgar Gabriel,et al.  Automatically Selecting the Number of Aggregators for Collective I/O Operations , 2011, 2011 IEEE International Conference on Cluster Computing.

[10]  Emmanuel Jeannot,et al.  TAPIOCA: An I/O Library for Optimized Topology-Aware Data Aggregation on Large-Scale Supercomputers , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[11]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[12]  Rajeev Thakur,et al.  LACIO: A New Collective I/O Strategy for Parallel I/O Systems , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[13]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[14]  Xian-He Sun,et al.  S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems , 2014, 2014 IEEE 34th International Conference on Distributed Computing Systems.

[15]  Wei-keng Liao,et al.  Design and Evaluation of MPI File Domain Partitioning Methods under Extent-Based File Locking Protocol , 2011, IEEE Transactions on Parallel and Distributed Systems.

[16]  Karen Schuchardt,et al.  IO strategies and data services for petascale data sets from a global cloud resolving model , 2007 .

[17]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[18]  Yutaka Ishikawa,et al.  Improving collective I/O performance using pipelined two-phase I/O , 2012, HiPC 2012.

[19]  Rob VanderWijngaart,et al.  NAS Parallel Benchmarks I/O Version 2.4. 2.4 , 2002 .

[20]  Rajeev Thakur,et al.  Users guide for ROMIO: A high-performance, portable MPI-IO implementation , 1997 .

[21]  Yutaka Ishikawa,et al.  Multithreaded Two-Phase I/O: Improving Collective MPI-IO Performance on a Lustre File System , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[22]  Xin Huang,et al.  A cost-aware region-level data placement scheme for hybrid parallel I/O systems , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[23]  Song Jiang,et al.  iBridge: Improving Unaligned Parallel File Access with Solid-State Drives , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[24]  Robert Latham,et al.  Integration of Burst Buffer in High-level Parallel I/O Library for Exa-scale Computing Era , 2018, 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS).

[25]  Alok N. Choudhary,et al.  Improved parallel I/O via a two-phase run-time access strategy , 1993, CARN.

[26]  Bronis R. de Supinski,et al.  Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[27]  Wei-keng Liao,et al.  Dynamically adapting file domain partitioning methods for collective I/O based on underlying parallel file system locking protocols , 2008, HiPC 2008.

[28]  Wei-keng Liao,et al.  A case study for scientific I/O: improving the FLASH astrophysics code , 2012 .

[29]  Wei-keng Liao,et al.  An Implementation and Evaluation of Client-Side File Caching for MPI-IO , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.