ENHANCING THE MATRIX TRANSPOSE OPERATION USING INTEL AVX INSTRUCTION SET EXTENSION

General-purpose microprocessors are augmented with short-vector instruction extensions in order to simultaneously process more than one data element using the same operation. This type of parallelism is known as data-parallel processing. Many scientific, engineering, and signal processing applications can be formulated as matrix operations. Therefore, accelerating these kernel operations on microprocessors, which are the building blocks or large high-performance computing systems, will definitely boost the performance of the aforementioned applications. In this paper, we consider the acceleration of the matrix transpose operation using the 256-bit Intel advanced vector extension (AVX) instructions. We present a novel vector-based matrix transpose algorithm and its optimized implementation using AVX instructions. The experimental results on Intel Core i7 processor demonstrates a 2.83 speedup over the standard sequential implementation, and a maximum of 1.53 speedup over the GCC library implementation. When the transpose is combined with matrix addition to compute the matrix update, B + A T , where A and B are

[1]  Jaeyoung Choi,et al.  Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers , 1995, Parallel Comput..

[2]  J. O. Eklundh,et al.  A Fast Computer Method for Matrix Transposing , 1972, IEEE Transactions on Computers.

[3]  Jyh-Jong Tsay,et al.  Optimal Algorithm for Matrix Transpose on Wormhole-Switched Meshes , 2003, J. Inf. Sci. Eng..

[4]  Alan Jay Smith,et al.  Multimedia extensions for general purpose microprocessors: a survey , 2005, Microprocess. Microsystems.

[5]  Viktor K. Prasanna,et al.  An Efficient Algorithm for Out-of-Core Matrix Transposition , 2002, IEEE Trans. Computers.

[6]  Stanislav G. Sedukhin,et al.  Matrix Transpose on 2D Torus Array Processor , 2006, The Sixth IEEE International Conference on Computer and Information Technology (CIT'06).

[7]  Nicolai Petkov,et al.  Systolic Parallel Processing , 1992 .

[8]  Sriram Krishnamoorthy,et al.  Efficient parallel out-of-core matrix transposition , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.

[9]  P. Sadayappan,et al.  Efficient transposition algorithms for large matrices , 1993, Supercomputing '93.

[10]  Ulrich Meyer,et al.  Matrix transpose on meshes: theory and practice , 1997, Proceedings 11th International Parallel Processing Symposium.