Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models

The code let model is a fine-grain dataflow-inspired program execution model that balances the parallelism and overhead of the runtime system. It plays an important role in terms of performance, scalability, and energy efficiency in exascale studies such as the DARPA UHPC project and the DOE X-Stack project. As an important application, the Fast Fourier Transform (FFT) has been deeply studied in fine-grain models, including the code let model. However, the existing work focuses on how fine-grain models achieve more balanced workload comparing to traditional coarse-grain models. In this paper, we make an important observation that the flexibility of execution order of tasks in fine-grain models improves utilization of memory bandwidth as well. We use the code let model and the FFT application as a case study to show that a proper execution order of tasks (or code lets) can significantly reduce memory contention and thus improve performance. We propose an algorithm that provides a heuristic guidance of the execution order of the code lets to reduce memory contention. We implemented our algorithm on the IBM Cyclops-64 architecture. Experimental results show that our algorithm improves up to 46% performance compared to a state-of-the-art coarse-grain implementation of the FFT application on Cyclops-64.

[1]  Michael Pippig An Efficient and Flexible Parallel FFT Implementation Based on FFTW , 2010, CHPC.

[2]  Guang R. Gao,et al.  Optimized Dense Matrix Multiplication on a Many-Core Architecture , 2010, Euro-Par.

[3]  Sadaf R. Alam,et al.  Characterization of Scientific Workloads on Systems with Multi-Core Processors , 2006, 2006 IEEE International Symposium on Workload Characterization.

[4]  Naga K. Govindaraju,et al.  High performance discrete Fourier transforms on graphics processors , 2008, HiPC 2008.

[5]  Yao Zhang,et al.  Parallel Computing Experiences with CUDA , 2008, IEEE Micro.

[6]  Nader Bagherzadeh,et al.  Parallel FFT Algorithms on Network-on-Chips , 2008, ITNG.

[7]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Thomas Bemmerl,et al.  Evaluation and improvements of programming models for the Intel SCC many-core processor , 2011, 2011 International Conference on High Performance Computing & Simulation.

[9]  Guang R. Gao,et al.  Optimizing the Fast Fourier Transform on a Multi-core Architecture , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[10]  Naga K. Govindaraju,et al.  Fast computation of general Fourier Transforms on GPUS , 2008, 2008 IEEE International Conference on Multimedia and Expo.

[11]  A. B. Saybasili HIGHLY PARALLEL MULTI-DIMENSIONAL FAST FOURIER TRANSFORM ON FINE-AND COARSE-GRAINED MANY-CORE APPROACHES , 2022 .

[12]  Lixia Liu,et al.  Analyzing memory access intensity in parallel programs on multicore , 2008, ICS '08.

[13]  David A. Bader,et al.  FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine , 2007, HiPC.

[14]  Franz Franchetti,et al.  Discrete fourier transform on multicore , 2009, IEEE Signal Processing Magazine.

[15]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[16]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[17]  Guang R. Gao,et al.  Toward high-throughput algorithms on many-core architectures , 2012, TACO.

[18]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[19]  Guang R. Gao,et al.  Multithreaded algorithms for the fast Fourier transform , 2000, SPAA '00.

[20]  Dragan Mirkovic,et al.  An adaptive software library for fast Fourier transforms , 2000, ICS '00.

[21]  D. Takahashi Implementation and Evaluation of Parallel FFT Using SIMD Instructions on Multi-core Processors , 2007, Innovative architecture for future generation high-performance processors and systems (iwia 2007).

[22]  Kenneth Moreland,et al.  The FFT on a GPU , 2003, HWWS '03.

[23]  Nader Bagherzadeh,et al.  Fast parallel FFT on a reconfigurable computation platform , 2003, Proceedings. 15th Symposium on Computer Architecture and High Performance Computing.

[24]  Sriram R. Vangal,et al.  A 2 Tb/s 6$\,\times\,$ 4 Mesh Network for a Single-Chip Cloud Computer With DVFS in 45 nm CMOS , 2011, IEEE Journal of Solid-State Circuits.

[25]  Sally A. McKee,et al.  Reflections on the memory wall , 2004, CF '04.

[26]  Dhabaleswar K. Panda,et al.  Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[27]  Guang R. Gao,et al.  Earth: an efficient architecture for running threads , 1999 .

[28]  Guang R. Gao,et al.  Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures , 2012, CF '12.

[29]  Denis Foley,et al.  A Low-Power Integrated x86-64 and Graphics Processor for Mobile Computing Devices , 2012, IEEE J. Solid State Circuits.

[30]  Xu Wang,et al.  Large-scale fast Fourier transform on a heterogeneous multi-core system , 2012, Int. J. High Perform. Comput. Appl..

[31]  V. Volkov,et al.  Fitting FFT onto the G 80 Architecture , 2008 .

[32]  Jarek Nieplocha,et al.  Early experience with out-of-core applications on the cray XMT , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[33]  M. Puschel,et al.  FFT Program Generation for Shared Memory: SMP and Multicore , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[34]  S. Lennart Johnsson,et al.  Scheduling FFT computation on SMP and multicore systems , 2007, ICS '07.

[35]  Cheol-Hong Kim,et al.  Parallel implementation of the FFT algorithm using a multi-core processor , 2010, International Forum on Strategic Technology 2010.

[36]  Robert J. Fowler,et al.  Modeling memory concurrency for multi-socket multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[37]  Alexander V. Veidenbaum,et al.  Innovative Architecture for Future Generation High-Performance Processors and Systems , 2003, Innovative Architecture for Future Generation High-Performance Processors and Systems, 2003.

[38]  Naga K. Govindaraju,et al.  Auto-tuning of fast fourier transform on graphics processors , 2011, PPoPP '11.

[39]  Jack Dongarra,et al.  Special Issue on Program Generation, Optimization, and Platform Adaptation , 2005, Proc. IEEE.

[40]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[41]  Jack B. Dennis,et al.  Data flow schemas , 1972, International Sympoisum on Theoretical Programming.

[42]  Timothy G. Mattson,et al.  Programming the Intel 80-core network-on-a-chip Terascale Processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[43]  Sriram R. Vangal,et al.  A 2 Tb/s 6 × 4 Mesh Network for a Single-Chip Cloud Computer With DVFS in 45 nm CMOS , 2011, VLSIC 2011.

[44]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[45]  Mike Butts,et al.  Synchronization through Communication in a Massively Parallel Processor Array , 2007, IEEE Micro.

[46]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[47]  Fabrizio Petrini,et al.  Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[48]  Matteo Frigo A Fast Fourier Transform Compiler , 1999, PLDI.