Computing discrete transforms on the Cell Broadband Engine

Discrete transforms are of primary importance and fundamental kernels in many computationally intensive scientific applications. In this paper, we investigate the performance of two such algorithms; Fast Fourier Transform (FFT) and Discrete Wavelet Transform (DWT), on the Sony-Toshiba-IBM Cell Broadband Engine (Cell/B.E.), a heterogeneous multicore chip architected for intensive gaming applications and high performance computing. We design an efficient parallel implementation of Fast Fourier Transform (FFT) to fully exploit the architectural features of the Cell/B.E. Our FFT algorithm uses an iterative out-of-place approach and for 1K to 16K complex input samples outperforms all other parallel implementations of FFT on the Cell/B.E. including FFTW. Our FFT implementation obtains a single-precision performance of 18.6 GFLOP/s on the Cell/B.E., outperforming Intel Duo Core (Woodcrest) for inputs of greater than 2K samples. We also optimize Discrete Wavelet Transform (DWT) in the context of JPEG2000 for the Cell/B.E. DWT has an abundant parallelism, however, due to the low temporal locality of the algorithm, memory bandwidth becomes a significant bottleneck in achieving high performance. We introduce a novel data decomposition scheme to achieve highly efficient DMA data transfer and vectorization with low programming complexity. Also, we merge the multiple steps in the algorithm to reduce the bandwidth requirement. This leads to a significant enhancement in the scalability of the implementation. Our optimized implementation of DWT demonstrates 34 and 56 times speedup using one Cell/B.E. chip to the baseline code for the lossless and lossy transforms, respectively. We also provide the performance comparison with the AMD Barcelona (Quad-core Opteron) processor, and the Cell/B.E. excels the AMD Barcelona processor. This highlights the advantage of the Cell/B.E. over general purpose multicore processors in processing regular and bandwidth intensive scientific applications.

[1]  Sang H. Dhong,et al.  The vector floating-point unit in a synergistic processor element of a CELL processor , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[2]  Rade Kutil A single-loop approach to SIMD parallelization of 2D wavelet lifting , 2006, 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06).

[3]  Edward H. Adelson,et al.  The Laplacian Pyramid as a Compact Image Code , 1983, IEEE Trans. Commun..

[4]  Michael T. Orchard,et al.  Parallel Algorithms for the Two-Dimensional Discrete Wavelet Transform , 1994, 1994 International Conference on Parallel Processing Vol. 3.

[5]  Faouzi Kossentini,et al.  JasPer: a software-based JPEG-2000 codec implementation , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[6]  ChenT.,et al.  Cell Broadband Engine Architecture and its first implementation—A view , 2007 .

[7]  Stamatis Vassiliadis,et al.  Performance comparison of SIMD implementations of the discrete wavelet transform , 2005, 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).

[8]  P. Vaidyanathan Quadrature mirror filter banks, M-band extensions and perfect-reconstruction techniques , 1987, IEEE ASSP Magazine.

[9]  Andreas Uhl,et al.  Parallel JPEG2000 image coding on multiprocessors , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[10]  David H. Bailey A High-Performance FFT Algorithm for Vector Supercomputers , 1987, PPSC.

[11]  David A. Bader,et al.  FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine , 2007, HiPC.

[12]  Michael W. Marcellin,et al.  JPEG2000 - image compression fundamentals, standards and practice , 2002, The Kluwer International Series in Engineering and Computer Science.

[13]  R.C. Agarwal,et al.  Vectorized mixed radix discrete Fourier transform algorithms , 1987, Proceedings of the IEEE.

[14]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[15]  Liang-Gee Chen,et al.  Analysis and architecture design of block-coding engine for EBCOT in JPEG 2000 , 2003, IEEE Trans. Circuits Syst. Video Technol..

[16]  Michael W. Marcellin,et al.  JPEG2000: standard for interactive imaging , 2002, Proc. IEEE.

[17]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[18]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[19]  Francisco Tirado,et al.  Parallel wavelet transform for large scale image processing , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[20]  Andrew G. Lyne,et al.  A segmented FFT algorithm for vector computers , 1988, Parallel Comput..

[21]  Amir AVERBUCH,et al.  A parallel FFT on an MIMD machine , 1990, Parallel Comput..

[22]  Jonas Larsson,et al.  Space Time Adaptive Processing Estimates for IBM/Sony/Toshiba Cell Broadband Engine Processor , 2006, 2006 International Radar Symposium.

[23]  Fabrizio Petrini,et al.  Cell Multiprocessor Communication Network: Built for Speed , 2006, IEEE Micro.

[24]  David A. Bader,et al.  Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine , 2008, 2008 37th International Conference on Parallel Processing.

[25]  Hidemasa Muta,et al.  Multilevel parallelization on the cell/B.E. for a motion JPEG 2000 encoding server , 2007, ACM Multimedia.

[26]  Markus Hegland,et al.  Parallel Performance of Fast Wavelet Transforms , 2000, Int. J. High Speed Comput..

[27]  Linda Yang,et al.  Coarse-Grained Parallel Algorithms for Multi-Dimensional Wavelet Transforms , 2004, The Journal of Supercomputing.

[28]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[29]  Wim Sweldens,et al.  The lifting scheme: a construction of second generation wavelets , 1998 .

[30]  B. Flachs,et al.  A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[31]  Wim Sweldens,et al.  Lifting scheme: a new philosophy in biorthogonal wavelet constructions , 1995, Optics + Photonics.