Performance Optimization of Multithreaded 2D Fast Fourier Transform on Multicore Processors Using Load Imbalancing Parallel Computing Method

Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on modern multicore platforms is therefore of paramount concern to the high-performance computing community. The inherent complexities in these platforms such as severe resource contention and non-uniform memory access, however, pose formidable challenges. We study the performance profiles of multithreaded 2D FFTs provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel Math Kernel Library (Intel MKL) FFT, on a modern Intel Haswell multicore processor consisting of 36 cores. We show that all the three routines exhibit drastic performance variations, and hence, their average performances are considerably lower than their peak performances. The ratios of average-to-peak performance for the 2D FFT routines from the three packages are 40%, 30%, and 24%. We conclude that improving the average performance of 2D FFT on modern multicore processors by the removal of performance variations constitutes a tremendous research challenge. To address this challenge, we propose two novel optimization methods, PFFT-FPM and PFFT-FPM-PAD, specifically designed and implemented for 2D FFT. The methods employ model-based parallel computing using a load-imbalancing technique. They take as inputs, the discrete 3D functions of the performance of the processors against problem size, compute 2D DFT of a complex signal matrix of size <inline-formula> <tex-math notation="LaTeX">$N \times N$ </tex-math></inline-formula> using <inline-formula> <tex-math notation="LaTeX">$p$ </tex-math></inline-formula> abstract processors, and output the transformed signal matrix. Based on our experiments on a modern Intel Haswell multicore server consisting of 36 physical cores, the average and maximum speedups observed for PFFT-FPM using FFTW-3.3.7 are <inline-formula> <tex-math notation="LaTeX">$1.9\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$6.8\times $ </tex-math></inline-formula>, and the average and maximum speedups observed using Intel MKL FFT are <inline-formula> <tex-math notation="LaTeX">$1.3\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$2\times $ </tex-math></inline-formula>. The average and maximum speedups observed for PFFT-FPM-PAD using FFTW-3.3.7 are <inline-formula> <tex-math notation="LaTeX">$2\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$9.4\times $ </tex-math></inline-formula>, and the average and maximum speedups observed using Intel MKL FFT are <inline-formula> <tex-math notation="LaTeX">$1.4\times $ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$5.9\times $ </tex-math></inline-formula>.

[1]  Wei Chu,et al.  A Noise-Robust FFT-Based Auditory Spectrum With Application in Audio Classification , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  José Nelson Amaral,et al.  Forma: A framework for safe automatic array reshaping , 2007, ACM Trans. Program. Lang. Syst..

[3]  Alexey L. Lastovetsky,et al.  Dynamic Load Balancing of Parallel Computational Iterative Routines on Highly Heterogeneous HPC Platforms , 2011, Parallel Process. Lett..

[4]  Guang R. Gao,et al.  Optimizing the Fast Fourier Transform on a Multi-core Architecture , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[5]  João P.F. Barbosa,et al.  A high performance hardware accelerator for dynamic texture segmentation , 2015, J. Syst. Archit..

[6]  Satoshi Matsuoka,et al.  An efficient, model-based CPU-GPU heterogeneous FFT library , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  Yifeng Chen,et al.  Large-scale FFT on GPU clusters , 2010, ICS '10.

[8]  Lian-Ping Wang,et al.  Scalable parallel FFT for spectral simulations on a Beowulf cluster , 2001, Parallel Comput..

[9]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[10]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[11]  George Cybenko,et al.  Dynamic Load Balancing for Distributed Memory Multiprocessors , 1989, J. Parallel Distributed Comput..

[12]  Alexey L. Lastovetsky,et al.  Design and implementation of self-adaptable parallel algorithms for scientific computing on highly heterogeneous HPC platforms , 2011, ArXiv.

[13]  Alexey L. Lastovetsky,et al.  Model-Based Optimization of EULAG Kernel on Intel Xeon Phi Through Load Imbalancing , 2017, IEEE Transactions on Parallel and Distributed Systems.

[14]  Alexey Lastovetsky,et al.  Bi-Objective Optimization of Data-Parallel Applications on Homogeneous Multicore Clusters for Performance and Energy , 2018, IEEE Transactions on Computers.

[15]  Alexey L. Lastovetsky,et al.  Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing , 2015, ArXiv.

[16]  Francisco Almeida,et al.  Parallel FFT-2D in Heterogeneous Systems , 2005, Parallel and Distributed Computing and Networks.

[17]  Antonio J. Plaza,et al.  Automatic tuning of iterative computation on heterogeneous multiprocessors with ADITHE , 2011, The Journal of Supercomputing.

[18]  Jacques M. Bahi,et al.  Synchronous distributed load balancing on dynamic networks , 2005, J. Parallel Distributed Comput..

[19]  Dragan Matic,et al.  Fault Diagnosis of Rotating Electrical Machines in Transient Regime Using a Single Stator Current’s FFT , 2015, IEEE Transactions on Instrumentation and Measurement.

[20]  Toshiyuki Imamura,et al.  Parallel implementation of 3D FFT with volumetric decomposition schemes for efficient molecular dynamics simulations , 2016, Comput. Phys. Commun..

[21]  Alexey L. Lastovetsky,et al.  Data partitioning with a realistic performance model of networks of heterogeneous computers , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[22]  Hironori Kasahara,et al.  Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding , 2003, LCPC.

[23]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[24]  Ning Li,et al.  2DECOMP&FFT - A Highly Scalable 2D Decomposition Library and FFT Interface , 2010 .

[25]  Liang Gu,et al.  Using GPUs to compute large out-of-card FFTs , 2011, ICS '11.

[26]  Jeffrey K. Hollingsworth,et al.  Computation-communication overlap and parameter auto-tuning for scalable parallel 3-D FFT , 2016, J. Comput. Sci..

[27]  Lian-Ping Wang,et al.  Parallel implementation and scalability analysis of 3D Fast Fourier Transform using 2D domain decomposition , 2013, Parallel Comput..

[28]  Vilas H. Naik,et al.  Analysis of performance enhancement on graphic processor based heterogeneous architecture: A CUDA and MATLAB experiment , 2015, 2015 National Conference on Parallel Computing Technologies (PARCOMPTECH).

[29]  Fang Liu,et al.  An asynchronous load balancing scheme for multi-server systems , 2016, 2016 IEEE 7th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON).

[30]  Alexey L. Lastovetsky,et al.  Data Partitioning with a Functional Performance Model of Heterogeneous Processors , 2007, Int. J. High Perform. Comput. Appl..

[31]  Cédric Augonnet,et al.  Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures , 2009, Euro-Par Workshops.

[32]  Truong Vinh Truong Duy,et al.  A decomposition method with minimum communication amount for parallelization of multi-dimensional FFTs , 2014, Comput. Phys. Commun..

[33]  Alexey Lastovetsky,et al.  A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms , 2018, IEEE Transactions on Parallel and Distributed Systems.

[34]  Myeongsu Kang,et al.  Time-Varying and Multiresolution Envelope Analysis and Discriminative Feature Analysis for Bearing Fault Diagnosis , 2015, IEEE Transactions on Industrial Electronics.

[35]  Sriram Krishnamoorthy,et al.  Effective padding of multidimensional arrays to avoid cache conflict misses , 2016, PLDI.

[36]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[37]  Alexey L. Lastovetsky,et al.  New Model-Based Methods and Algorithms for Performance and Energy Optimization of Data Parallel Applications on Homogeneous Multicore Clusters , 2017, IEEE Transactions on Parallel and Distributed Systems.

[38]  Laurent Alaus,et al.  A common operator for FFT and FEC decoding , 2011, Microprocess. Microsystems.

[39]  Joseph JáJá,et al.  Optimized FFT computations on heterogeneous platforms with application to the Poisson equation , 2014, J. Parallel Distributed Comput..

[40]  Amir Averbuch,et al.  Portable parallel FFT for MIMD multiprocessors , 1998, Concurr. Pract. Exp..

[41]  Ioana Banicescu,et al.  Dynamic load balancing with adaptive factoring methods in scientific applications , 2007, The Journal of Supercomputing.

[42]  Jacques M. Bahi,et al.  Dynamic load balancing and efficient load estimators for asynchronous iterative algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.