We describe a decomposition for in-place matrix transposition, with applications to Array of Structures memory accesses on SIMD processors. Traditional approaches to in-place matrix transposition involve cycle following, which is difficult to parallelize, and on matrices of dimension m by n require O(mn log mn) work when limited to less than O(mn) auxiliary space. Our decomposition allows the rows and columns to be operated on independently during in-place transposition, reducing work complexity to O(mn), given O(max(m, n)) auxiliary space. This decomposition leads to an efficient and naturally parallel algorithm: we have measured median throughput of 19.5 GB/s on an NVIDIA Tesla K20c processor. An implementation specialized for the skinny matrices that arise when converting Arrays of Structures to Structures of Arrays yields median throughput of 34.3 GB/s, and a maximum throughput of 51 GB/s.
Because of the simple structure of this algorithm, it is particularly suited for implementation using SIMD instructions to transpose the small arrays that arise when SIMD processors load from or store to Arrays of Structures. Using this algorithm to cooperatively perform accesses to Arrays of Structures, we measure 180 GB/s throughput on the K20c, which is up to 45 times faster than compiler-generated Array of Structures accesses.
In this paper, we explain the algorithm, prove its correctness and complexity, and explain how it can be instantiated efficiently for solving various transpose problems on both CPUs and GPUs.
[1]
Lars Karlsson,et al.
Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion
,
2012,
TOMS.
[2]
David Thomas,et al.
The Art in Computer Programming
,
2001
.
[3]
Eugene E. Tyrtyshnikov,et al.
Optimal in-place transposition of rectangular matrices
,
2009,
Journal of Complexity.
[4]
I-Jui Sung.
Data layout transformation through in-place transposition
,
2013
.
[5]
Frank Thomson Leighton,et al.
Tight Bounds on the Complexity of Parallel Sorting
,
1985,
IEEE Trans. Computers.
[6]
Kevin Skadron,et al.
Scalable parallel programming
,
2008,
2008 IEEE Hot Chips 20 Symposium (HCS).
[7]
P. F. Windley,et al.
Transposing Matrices in a Digital Computer
,
1959,
Computer/law journal.
[8]
Juan Gómez-Luna,et al.
In-place transposition of rectangular matrices on accelerators
,
2014,
PPoPP '14.
[9]
Wen-mei W. Hwu,et al.
DL: A data layout transformation system for heterogeneous computing
,
2012,
2012 Innovative Parallel Computing (InPar).
[10]
Henry S. Warren,et al.
Hacker's Delight
,
2002
.