Novel Model-based Methods for Performance Optimization of Multithreaded 2D Discrete Fourier Transform on Multicore Processors

In this paper, we use multithreaded fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT, to present a novel model-based parallel computing technique as a very effective and portable method for optimization of scientific multithreaded routines for performance, especially in the current multicore era where the processors have abundant number of cores. We propose two optimization methods, PFFT-FPM and PFFT-FPM-PAD, based on this technique. They compute 2D-DFT of a complex signal matrix of size NxN using p abstract processors. Both algorithms take as inputs, discrete 3D functions of performance against problem size of the processors and output the transformed signal matrix. Based on our experiments on a modern Intel Haswell multicore server consisting of 36 physical cores, the average and maximum speedups observed for PFFT-FPM using FFTW-3.3.7 are 1.9x and 6.8x respectively and the average and maximum speedups observed using Intel MKL FFT are 1.3x and 2x respectively. The average and maximum speedups observed for PFFT-FPM-PAD using FFTW-3.3.7 are 2x and 9.4x respectively and the average and maximum speedups observed using Intel MKL FFT are 1.4x and 5.9x respectively.

[1]  Hironori Kasahara,et al.  Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding , 2003, LCPC.

[2]  Alexey Lastovetsky,et al.  A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms , 2018, IEEE Transactions on Parallel and Distributed Systems.

[3]  Vilas H. Naik,et al.  Analysis of performance enhancement on graphic processor based heterogeneous architecture: A CUDA and MATLAB experiment , 2015, 2015 National Conference on Parallel Computing Technologies (PARCOMPTECH).

[4]  Peng Jiang,et al.  Efficient SIMD and MIMD parallelization of hash-based aggregation by conflict mitigation , 2017, ICS.

[5]  Lian-Ping Wang,et al.  Scalable parallel FFT for spectral simulations on a Beowulf cluster , 2001, Parallel Comput..

[6]  Fang Liu,et al.  An asynchronous load balancing scheme for multi-server systems , 2016, 2016 IEEE 7th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON).

[7]  Alexey L. Lastovetsky,et al.  Data Partitioning with a Functional Performance Model of Heterogeneous Processors , 2007, Int. J. High Perform. Comput. Appl..

[8]  Alexey L. Lastovetsky,et al.  Dynamic Load Balancing of Parallel Computational Iterative Routines on Highly Heterogeneous HPC Platforms , 2011, Parallel Process. Lett..

[9]  José Nelson Amaral,et al.  Forma: A framework for safe automatic array reshaping , 2007, ACM Trans. Program. Lang. Syst..

[10]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[11]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[12]  Satoshi Matsuoka,et al.  An efficient, model-based CPU-GPU heterogeneous FFT library , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[13]  Truong Vinh Truong Duy,et al.  A decomposition method with minimum communication amount for parallelization of multi-dimensional FFTs , 2014, Comput. Phys. Commun..

[14]  Liang Gu,et al.  Using GPUs to compute large out-of-card FFTs , 2011, ICS '11.

[15]  Cédric Augonnet,et al.  Automatic Calibration of Performance Models on Heterogeneous Multicore Architectures , 2009, Euro-Par Workshops.

[16]  Sriram Krishnamoorthy,et al.  Effective padding of multidimensional arrays to avoid cache conflict misses , 2016, PLDI.

[17]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[18]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[19]  Alexey L. Lastovetsky,et al.  Data partitioning with a realistic performance model of networks of heterogeneous computers , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[20]  George Cybenko,et al.  Dynamic Load Balancing for Distributed Memory Multiprocessors , 1989, J. Parallel Distributed Comput..

[21]  Alexey L. Lastovetsky,et al.  Design and implementation of self-adaptable parallel algorithms for scientific computing on highly heterogeneous HPC platforms , 2011, ArXiv.

[22]  Alexey L. Lastovetsky,et al.  Model-Based Optimization of EULAG Kernel on Intel Xeon Phi Through Load Imbalancing , 2017, IEEE Transactions on Parallel and Distributed Systems.

[23]  Ning Li,et al.  2DECOMP&FFT - A Highly Scalable 2D Decomposition Library and FFT Interface , 2010 .

[24]  Antonio J. Plaza,et al.  Automatic tuning of iterative computation on heterogeneous multiprocessors with ADITHE , 2011, The Journal of Supercomputing.

[25]  Jacques M. Bahi,et al.  Synchronous distributed load balancing on dynamic networks , 2005, J. Parallel Distributed Comput..

[26]  Alexey Lastovetsky,et al.  Bi-Objective Optimization of Data-Parallel Applications on Homogeneous Multicore Clusters for Performance and Energy , 2018, IEEE Transactions on Computers.

[27]  Alexey L. Lastovetsky,et al.  Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing , 2015, ArXiv.

[28]  Ioana Banicescu,et al.  Dynamic load balancing with adaptive factoring methods in scientific applications , 2007, The Journal of Supercomputing.

[29]  Jacques M. Bahi,et al.  Dynamic load balancing and efficient load estimators for asynchronous iterative algorithms , 2005, IEEE Transactions on Parallel and Distributed Systems.

[30]  Alexey L. Lastovetsky,et al.  New Model-Based Methods and Algorithms for Performance and Energy Optimization of Data Parallel Applications on Homogeneous Multicore Clusters , 2017, IEEE Transactions on Parallel and Distributed Systems.

[31]  Joseph JáJá,et al.  Optimized FFT computations on heterogeneous platforms with application to the Poisson equation , 2014, J. Parallel Distributed Comput..

[32]  Amir Averbuch,et al.  Portable parallel FFT for MIMD multiprocessors , 1998, Concurr. Pract. Exp..

[33]  Yifeng Chen,et al.  Large-scale FFT on GPU clusters , 2010, ICS '10.