Large-Scale Discrete Fourier Transform on TPUs

In this work, we present two parallel algorithms for the large-scale discrete Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two parallel algorithms are associated with two DFT formulations: one formulation, denoted as KDFT, is based on the Kronecker product; the other is based on the famous Cooley-Tukey algorithm and phase adjustment, denoted as FFT. Both KDFT and FFT formulations take full advantage of TPU’s strength in matrix multiplications. The KDFT formulation allows direct use of nonuniform inputs without additional step. In the two parallel algorithms, the same strategy of data decomposition is applied to the input data. Through the data decomposition, the dense matrix multiplications in KDFT and FFT are kept local within TPU cores, which can be performed completely in parallel. The communication among TPU cores is achieved through the one-shuffle scheme in both parallel algorithms, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme is designed for the interconnect topology of TPU clusters, minimizing the time required by the communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow. The three-dimensional complex DFT is performed on an example of dimension $8192 \times 8192 \times 8192$ with a full TPU Pod: the run time of KDFT is 12.66 seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to demonstrate the high parallel efficiency of the two DFT implementations on TPUs.

[1]  Vipin Kumar,et al.  Introduction to Parallel Computing , 1994 .

[2]  James E. Stevens,et al.  A fast fourier transform subroutine for ILLIAC IV , 1971 .

[3]  Xu Liu,et al.  Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect , 2019, IEEE Transactions on Parallel and Distributed Systems.

[4]  Rama Cont Frontiers in quantitative finance : volatility and credit risk modeling , 2008 .

[5]  L. Greengard,et al.  The type 3 nonuniform FFT and its applications June - , 2005 .

[6]  John Anderson,et al.  Tensor Processing Units for Financial Monte Carlo , 2019, PPSC.

[7]  Nikhil Ketkar,et al.  Introduction to PyTorch , 2021, Deep Learning with Python.

[8]  Oscar Gustafsson,et al.  Challenging the limits of FFT performance on FPGAs (Invited paper) , 2014, 2014 International Symposium on Integrated Circuits (ISIC).

[9]  L. Greengard,et al.  Short Note: The type 3 nonuniform FFT and its applications , 2005 .

[10]  Paul N. Swarztrauber,et al.  Multiprocessor FFTs , 1987, Parallel Comput..

[11]  Doru-Thom Popovici,et al.  A Flexible Framework for Parallel Multi-Dimensional DFTs , 2019, ArXiv.

[12]  P. Duhamel,et al.  `Split radix' FFT algorithm , 1984 .

[13]  Narayanan Vijaykrishnan,et al.  Multidimensional DFT IP Generator for FPGA Platforms , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[14]  Randy H. Katz,et al.  A Berkeley View of Systems Challenges for AI , 2017, ArXiv.

[15]  Sanjit K. Mitra,et al.  Kronecker Products, Unitary Matrices and Signal Processing Applications , 1989, SIAM Rev..

[16]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[17]  Daisuke Takahashi An Implementation of Parallel 3-D FFT with 2-D Decomposition on a Massively Parallel Cluster of Multi-core Processors , 2009, PPAM.

[18]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[19]  Vipin Kumar,et al.  The Scalability of FFT on Parallel Computers , 1993, IEEE Trans. Parallel Distributed Syst..

[20]  Yifeng Chen,et al.  Large-scale FFT on GPU clusters , 2010, ICS '10.

[21]  S. K. Mitra,et al.  The nonuniform discrete Fourier transform and its applications in filter design. II. 2-D , 1996 .

[22]  Daisuke Takahashi Fast Fourier Transform Algorithms for Parallel Computers , 2019, High-Performance Computing Series.

[23]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[24]  Ronald N. Bracewell,et al.  The Fourier Transform and Its Applications , 1966 .

[25]  Alexander Heinecke,et al.  Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations , 2019, 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH).

[26]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[27]  Georgios B. Giannakis,et al.  Signal processing for Big Data , 2014, 2014 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA).

[28]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[29]  Denis Foley,et al.  Ultra-Performance Pascal GPU and NVLink Interconnect , 2017, IEEE Micro.

[30]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[31]  John R. Anderson,et al.  High performance Monte Carlo simulation of ising model on TPU clusters , 2019, SC.

[32]  Aleksandr Ometov,et al.  Visualizing Big Data with augmented and virtual reality: challenges and research agenda , 2015, Journal of Big Data.

[33]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[34]  Georgios B. Giannakis,et al.  Signal Processing for Big Data [From the Guest Editors] , 2014, IEEE Signal Process. Mag..

[35]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).