FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation

Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data transfers, and derive memory access pattern efficient algorithms using custom block data layouts. These algorithms need to be carefully mapped to the targeted platform’s architecture, particularly the memory subsystem, to fully utilize performance and energy efficiency potentials. Using the Kronecker product formalism, we integrate our optimizations into Spiral framework and evaluate a family of DRAM-optimized FFT algorithms and their hardware implementation design space via automated techniques. In our evaluations, we demonstrate DRAM-optimized accelerator designs over a large tradeoff space given various problem (single/double precision 1D, 2D and 3D FFTs) and hardware platform (off-chip DRAM, 3D-stacked DRAM, ASIC, FPGA, etc.) parameters. We show that Spiral generated pareto optimal designs can achieve close to theoretical peak performance of the targeted platform offering 6x and 6.5x system performance and power efficiency improvements respectively over conventional row-column FFT algorithms.

[1]  Franz Franchetti,et al.  Design Automation Framework for Application-Specific Logic-in-Memory Blocks , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[2]  Franz Franchetti,et al.  Understanding the design space of DRAM-optimized hardware FFT accelerators , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[3]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[5]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[6]  Robert S. Germain,et al.  Scalable framework for 3D FFTs on the Blue Gene/L supercomputer: Implementation and early performance measurements , 2005, IBM J. Res. Dev..

[7]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Narayanan Vijaykrishnan,et al.  Multidimensional DFT IP Generator for FPGA Platforms , 2011, IEEE Transactions on Circuits and Systems I: Regular Papers.

[9]  J. Thomas Pawlowski,et al.  Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[10]  Franz Franchetti,et al.  HAMLeT: Hardware accelerated memory layout transform within 3D-stacked DRAM , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[11]  James C. Hoe,et al.  Automatic generation of streaming datapaths for arbitrary fixed permutations , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[12]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[13]  Franz Franchetti,et al.  Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.

[14]  Franz Franchetti,et al.  A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[15]  Franz Franchetti,et al.  FFT (Fast Fourier Transform) , 2011, Encyclopedia of Parallel Computing.

[16]  David Padua,et al.  Encyclopedia of Parallel Computing , 2011 .

[17]  C. Loan Computational Frameworks for the Fast Fourier Transform , 1992 .

[18]  James C. Hoe,et al.  Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  David H. Bailey,et al.  FFTs in external or hierarchical memory , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[20]  Luca Benini,et al.  Exploration and Optimization of 3-D Integrated DRAM Subsystems , 2013, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[21]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[22]  Andreas Gerstlauer,et al.  Transforming a linear algebra core to an FFT accelerator , 2013, 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors.

[23]  Franz Franchetti,et al.  Memory Bandwidth Efficient Two-Dimensional Fast Fourier Transform Algorithm and Implementation for Large Problem Sizes , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[24]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[25]  Franz Franchetti,et al.  Computer Generation of Hardware for Linear Digital Signal Processing Transforms , 2012, TODE.

[26]  Franz Franchetti,et al.  FFTS with near-optimal memory access through block data layouts , 2014, ICASSP.

[27]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[28]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).