Topology-Aware Data Aggregation for High Performance Collective MPI-IO on a Multi-core Cluster System

Parallel I/O such as MPI-IO is one of the performance improvement solutions in parallel computing using MPI. ROMIO is a widely used MPI-IO implementation which addresses to improve collective I/O performance by using its optimization named two-phase I/O. File I/O task is given to a subset of or all of MPI processes, which are called aggregators. Multiple CPUs or CPU cores give a chance to increase computing power by deploying multiple MPI processes per compute node, while such deployment leads to poor I/O performance due to ROMIO's topology-unaware aggregator layout. In our previous work, optimized aggregator layout which was suitable for striping accesses on a Lustre file system improved I/O performance, however, its unbalanced communication load due to unawareness in MPI rank layout among compute nodes led to ineffective data aggregation. To address minimization in data aggregation time for further I/O performance improvements, we introduce a topology-aware data aggregation scheme which takes care of MPI rank layout across compute nodes. The proposal arranges data collection sequence by aggregators in order to mitigate network contention. The optimization has achieved up to 67% improvements in I/O performance compared with the original ROMIO in HPIO benchmark runs using 768 processes on 64 compute nodes of the TSUBAME2.5 supercomputer at the Tokyo Institute of Technology. Even if the number of aggregators was half or 1/3 of the total number of processes, the optimization has still kept comparable I/O performance with the maximum performance.

[1]  Yong Chen,et al.  Hierarchical I/O Scheduling for Collective I/O , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[2]  Wei-keng Liao,et al.  Improving the Average Response Time in Collective I/O , 2011, EuroMPI.

[3]  Seung Ryoul Maeng,et al.  An Efficient I/O Aggregator Assignment Scheme for Collective I/O Considering Processor Affinity , 2011, 2011 40th International Conference on Parallel Processing Workshops.

[4]  Dhabaleswar K. Panda,et al.  Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[5]  Wei-keng Liao,et al.  Evaluating I/O characteristics and methods for storing structured scientific data , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[6]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[7]  Yong Chen,et al.  Locality-driven high-level I/O aggregation for processing scientific datasets , 2013, 2013 IEEE International Conference on Big Data.

[8]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Yutaka Ishikawa,et al.  Striping Layout Aware Data Aggregation for High Performance I/O on a Lustre File System , 2015, ISC.

[10]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[11]  Song Jiang,et al.  Orthrus: A Framework for Implementing Efficient Collective I/O in Multi-core Clusters , 2014, ISC.

[12]  George Bosilca,et al.  HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[13]  Cong Xu,et al.  SLOAVx: Scalable LOgarithmic AlltoallV Algorithm for Hierarchical Multicore Systems , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[14]  Robert B. Haber Scientific visualization and the Rivers Project at the National Center for Supercomputing Applications , 1989, Computer.

[15]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[16]  Michael E. Papka,et al.  Topology-aware data movement and staging for I/O acceleration on Blue Gene/P supercomputing systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).