InterferenceRemoval: removing interference of disk access for MPI programs through data replication

As the number of I/O-intensive MPI programs becomes increasingly large, many efforts have been made to improve I/O performance, on both software and architecture sides. On the software side, researchers can optimize processes' access patterns, either individually (e.g., by using large and sequential requests in each process), or collectively (e.g., by using collective I/O). On the architecture side, files are striped over multiple I/O nodes for a high aggregate I/O throughput. However, a key weakness, the access interference on each I/O node, remains unaddressed in these efforts. When requests from multiple processes are served simultaneously by multiple I/O nodes, one I/O node has to concurrently serve requests from different processes. Usually the I/O node stores its data on the hard disks, and different process accesses different regions of a data set. When there are a burst of requests from multiple processes, requests from different processes to a disk compete with each other for its single disk head to access data. The disk efficiency can be significantly reduced due to frequent disk head seeks. In this paper, we propose a scheme, InterferenceRemoval, to eliminate I/O interference by taking advantage of optimized access patterns and potentially high throughput provided by multiple I/O nodes. It identifies segments of files that could be involved in the interfering accesses and replicates them to their respectively designated I/O nodes. When the interference is detected at an I/O node, some I/O requests can be re-directed to the replicas on other I/O nodes, so that each I/O node only serves requests from one or a limited number of processes. InterferenceRemoval has been implemented in the MPI library for high portability on top of the Lustre parallel file system. Our experiments with representative benchmarks, such as NPB BTIO and mpi-tile-io, show that it can significantly improve I/O performance of MPI programs. For example, the I/O throughput of mpi-tile-io can be increased by 105% as compared to that without using collective I/O, and by 23% as compared to that using collective I/O.

[1]  Song Jiang,et al.  STEP: Sequentiality and Thrashing Detection Based Prefetching to Improve Performance of Networked Storage Servers , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[2]  Mahmut T. Kandemir,et al.  Improving I/O Performance of Applications through Compiler-Directed Code Restructuring , 2008, FAST.

[3]  Vagelis Hristidis,et al.  BORG: Block-reORGanization for Self-optimizing Storage Systems , 2009, FAST.

[4]  Peter S. Pacheco Parallel programming with MPI , 1996 .

[5]  Robert B. Ross,et al.  Noncontiguous I/O accesses through MPI-IO , 2003, CCGrid 2003. 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid, 2003. Proceedings..

[6]  Peter Druschel,et al.  Anticipatory scheduling: a disk scheduling framework to overcome deceptive idleness in synchronous I/O , 2001, SOSP.

[7]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[8]  David Kotz,et al.  Disk-directed I/O for MIMD multiprocessors , 1994, OSDI '94.

[9]  Robert B. Ross,et al.  Efficient structured data access in parallel file systems , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[10]  Song Jiang,et al.  Making resonance a common case: A high-performance implementation of collective I/O on parallel file systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[11]  Marianne Winslett,et al.  Server-Directed Collective I/O in Panda , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[12]  孝蔵 藤井,et al.  NASA Ames Research Centerにおける数値流体力学研究 , 1985 .

[13]  Kang G. Shin,et al.  FS2: dynamic data replication in free disk space for improving disk performance and energy consumption , 2005, SOSP '05.

[14]  David R. Kaeli,et al.  Profile-guided I/O partitioning , 2003, ICS '03.

[15]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[16]  Alan Jay Smith,et al.  The automatic improvement of locality in storage systems , 2005, TOCS.

[17]  Chao Wang,et al.  Improving the availability of supercomputer job input data using temporal replication , 2009, Computer Science - Research and Development.