Practical realizations of 3D forward/inverse separable discrete transforms, such as Fourier transform, cosine/sine transform, etc. are frequently the principal limiters that prevent many practical applications from scaling to a large number of processors. Specifically, existing approaches, which are based primarily on 1D or 2D data decompositions, prevent the 3D transforms from effectively scaling to the maximum (possible / available) number of computer nodes. Recently, a novel, highly scalable, approach to realize forward/inverse 3D transforms has been proposed. It is based on a 3D decomposition of data and geared towards a torus network of computer nodes. The proposed algorithms requires compute-and-roll time-steps, where each step consists of an execution of multiple GEMM operations and concurrent movement of cubical data blocks between nearest-neighbor nodes (directly using the logical arrangements of the nodes within the torus). The proposed 3D orbital algorithms gracefully avoids the, required, 3D data transposition. The aim of this paper is to present a preliminary experimental performance study of the proposed implementation on two different high-performance computer architectures.
[1]
Ning Li,et al.
2DECOMP&FFT - A Highly Scalable 2D Decomposition Library and FFT Interface
,
2010
.
[2]
S. Sedukhin.
2012-001 Co-design of Extremely Scalable Algorithms / Architecture for 3-Dimensional Linear Transforms
,
2012
.
[3]
Lian-Ping Wang,et al.
Parallel implementation and scalability analysis of 3D Fast Fourier Transform using 2D domain decomposition
,
2013,
Parallel Comput..
[4]
Viktor K. Prasanna,et al.
High Performance Computing - HiPC 2003
,
2003,
Lecture Notes in Computer Science.
[5]
José E. Moreira,et al.
A Volumetric FFT for BlueGene/L
,
2003,
HiPC.
[6]
Jack J. Dongarra,et al.
A set of level 3 basic linear algebra subprograms
,
1990,
TOMS.
[7]
W. Walker,et al.
Mpi: a Standard Message Passing Interface 1 Mpi: a Standard Message Passing Interface
,
1996
.