论文信息 - ENHANCING THE MATRIX TRANSPOSE OPERATION USING INTEL AVX INSTRUCTION SET EXTENSION

ENHANCING THE MATRIX TRANSPOSE OPERATION USING INTEL AVX INSTRUCTION SET EXTENSION

General-purpose microprocessors are augmented with short-vector instruction extensions in order to simultaneously process more than one data element using the same operation. This type of parallelism is known as data-parallel processing. Many scientific, engineering, and signal processing applications can be formulated as matrix operations. Therefore, accelerating these kernel operations on microprocessors, which are the building blocks or large high-performance computing systems, will definitely boost the performance of the aforementioned applications. In this paper, we consider the acceleration of the matrix transpose operation using the 256-bit Intel advanced vector extension (AVX) instructions. We present a novel vector-based matrix transpose algorithm and its optimized implementation using AVX instructions. The experimental results on Intel Core i7 processor demonstrates a 2.83 speedup over the standard sequential implementation, and a maximum of 1.53 speedup over the GCC library implementation. When the transpose is combined with matrix addition to compute the matrix update, B + A T , where A and B are

Ahmed Zekri | A. Zekri

[1] Jaeyoung Choi,et al. Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers , 1995, Parallel Comput..

[2] J. O. Eklundh,et al. A Fast Computer Method for Matrix Transposing , 1972, IEEE Transactions on Computers.

[3] Jyh-Jong Tsay,et al. Optimal Algorithm for Matrix Transpose on Wormhole-Switched Meshes , 2003, J. Inf. Sci. Eng..

[4] Alan Jay Smith,et al. Multimedia extensions for general purpose microprocessors: a survey , 2005, Microprocess. Microsystems.

[5] Viktor K. Prasanna,et al. An Efficient Algorithm for Out-of-Core Matrix Transposition , 2002, IEEE Trans. Computers.

[6] Stanislav G. Sedukhin,et al. Matrix Transpose on 2D Torus Array Processor , 2006, The Sixth IEEE International Conference on Computer and Information Technology (CIT'06).

[7] Nicolai Petkov,et al. Systolic Parallel Processing , 1992 .

[8] Sriram Krishnamoorthy,et al. Efficient parallel out-of-core matrix transposition , 2004, 2003 Proceedings IEEE International Conference on Cluster Computing.

[9] P. Sadayappan,et al. Efficient transposition algorithms for large matrices , 1993, Supercomputing '93.

[10] Ulrich Meyer,et al. Matrix transpose on meshes: theory and practice , 1997, Proceedings 11th International Parallel Processing Symposium.