Optimization of Parallel I/O for Cannon's Algorithm Based on Lustre

Matrix multiplication is one of the most important operations in linear algebra, widely used in many fields of science and engineering. Cannon's algorithm is a classical distributed algorithm for matrix multiplication for two-dimensional meshes. Generally, MPI-IO is used for its I/O requirements. However it has been well documented that MPI-IO performs poorly in a Lustre file system environment. As the scale of matrix multiplication increased, this problem trends to be serious, becoming one key factor impacting performance of the program. In order to improve the performance of Collective I/O of Cannon's program, we proposed a new aggregation pattern (Stripe-continuous aggregation pattern), which fully considers the stripping mechanism and lock protocol of Lustre file system. The theoretical analysis and experimental results show that the pattern can make full use of the capacity of Lustre file system compared with the other patterns, and improve the I/O performance of the Cannon's program efficiently.

[1]  Phillip M. Dickens,et al.  Towards a High Performance Implementation of MPI-IO on the Lustre File System , 2008, OTM Conferences.

[2]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .

[3]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[4]  Wei-keng Liao,et al.  DAChe: Direct Access Cache System for Parallel I/O , 2005 .

[5]  Jeffrey S. Vetter,et al.  Exploiting Lustre File Joining for Effective Collective IO , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[6]  Marianne Winslett,et al.  RFS: efficient and flexible remote file access for MPI-IO , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[7]  Guang R. Gao,et al.  Optimized Dense Matrix Multiplication on a Many-Core Architecture , 2010, Euro-Par.

[8]  Wei-keng Liao,et al.  An Implementation and Evaluation of Client-Side File Caching for MPI-IO , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[9]  Rajeev Thakur,et al.  Data sieving and collective I/O in ROMIO , 1998, Proceedings. Frontiers '99. Seventh Symposium on the Frontiers of Massively Parallel Computation.

[10]  Jeremy S. Logan,et al.  Using Object Based Files for High Performance Parallel I/O , 2007, 2007 4th IEEE Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications.

[11]  Verdi March,et al.  Evaluation of a Performance Model of Lustre File System , 2010, 2010 Fifth Annual ChinaGrid Conference.

[12]  Michael Garland,et al.  Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .