PVFS over InfiniBand: design and performance evaluation

I/O is quickly emerging as the main bottleneck limiting performance in modern day clusters. The need for scalable parallel I/O and file systems is becoming more and more urgent. We examine the feasibility of leveraging infiniband technology to improve I/O performance and scalability of cluster file systems. We use parallel virtual file system (PVFS) as a basis for exploring these features. We design and implement a PVFS version on InfiniBand by taking advantage of InfiniBand features and resolving many challenging issues. We design the following: a transport layer customized for PVFS by trading transparency and generality for performance; buffer management for flow control, dynamic and fair buffer sharing, and efficient memory registration and deregistration. Compared to a PVFS implementation over standard TCP/IP on the same InfiniBand network, our implementation offers three times the bandwidth if workloads are not disk-bound and 40% improvement in bandwidth in the disk-bound case. Client CPU utilization is reduced to 1.5% from 91% on TCP/IP. To the best of our knowledge, this is the first design, implementation and evaluation of PVFS over InfiniBand. The research results demonstrate how to design high performance parallel file systems on next generation clusters with InfiniBand

[1]  Robert Hill,et al.  Functionality and Performance Evaluation of File Systems for Storage Area Networks (SAN) , 2000, IEEE Symposium on Mass Storage Systems.

[2]  Rajeev Thakur,et al.  On implementing MPI-IO portably and with high performance , 1999, IOPADS '99.

[3]  Syam Gadde,et al.  Cheating the I/O Bottleneck: Network Storage with Trapeze/Myrinet , 1998, USENIX Annual Technical Conference.

[4]  Willy Zwaenepoel,et al.  IO-Lite: a unified I/O buffering and caching system , 1999, TOCS.

[5]  Philip H. Carns Design and Analysis of a Network Transfer Layer for Parallel File Systems , 2001 .

[6]  Chita R. Das,et al.  A strategy to compute the InfiniBand arbitration tables , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[7]  Frank B. Schmuck,et al.  GPFS: A Shared-Disk File System for Large Computing Clusters , 2002, FAST.

[8]  Chuck Lever,et al.  Linux NFS Client Write Performance , 2002, USENIX Annual Technical Conference, FREENIX Track.

[9]  Margo I. Seltzer,et al.  Structure and Performance of the Direct Access File System , 2002, USENIX ATC, General Track.

[10]  Robert B. Ross,et al.  PVFS: A Parallel File System for Linux Clusters , 2000, Annual Linux Showcase & Conference.

[11]  Hiroshi Tezuka,et al.  Pin-down cache: a virtual memory management technique for zero-copy communication , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[12]  Kostas Magoutis,et al.  Design and Implementation of a Direct Access File System (DAFS) Kernel Server for FreeBSD , 2002, BSDCon.

[13]  Dhabaleswar K. Panda,et al.  MPI over InfiniBand: Early Experiences , 2003 .

[14]  Yuanyuan Zhou,et al.  Experiences with VI communication for database storage , 2002, ISCA.

[15]  Mark Wittle,et al.  Direct Access File System (DAFS) , 2001 .

[16]  Thomas L. Sterling,et al.  InfiniBand: The “De Facto” Future Standard for System and Local Area Networks or Just a Scalable Replacement for PCI Buses? , 2004, Cluster Computing.

[17]  Chita R. Das,et al.  Performance Enhancement Techniques for InfiniBand Architecture , 2003, HPCA 2003.

[18]  Robert B. Ross,et al.  Noncontiguous I/O through PVFS , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[19]  Robert B. Ross,et al.  REACTIVE SCHEDULING FOR PARALLEL I/O SYSTEMS , 2000 .

[20]  Thorsten von Eicken,et al.  Incorporating Memory Management into User-Level Network Interfaces , 1997 .