Benchmarking JPEG 2000 implementations on modern CPU and GPU architectures

Abstract The use of graphics hardware for non-graphics applications has become popular among many scientific programmers and researchers as we have observed a higher rate of theoretical performance increase than the CPUs in recent years. However, performance gains may be easily lost in the context of a specific parallel application due to various both hardware and software factors. JPEG 2000 is a complex standard for data compression and coding, that provides many advanced capabilities demanded by more specialized applications. There are several JPEG 2000 implementations that utilize emerging parallel architectures with the built-in support for parallelism at different levels. Unfortunately, many available implementations are only optimized for a certain parallel architecture or they do not take advantage of recent capabilities provided by modern hardware and low level APIs. Thus, the main aim of this paper is to present a comprehensive real performance analysis of JPEG 2000. It consists of a chain of data and compute intensive tasks that can be treated as good examples of software benchmarks for modern parallel hardware architectures. In this paper we compare achieved performance results of various JPEG 2000 implementations executed on selected architectures for different data sets to identify possible bottlenecks. We discuss also best practices and advices for parallel software development to help users to evaluate in advance and then select appropriate solutions to accelerate the execution of their applications.

[1]  Jos B. T. M. Roerdink,et al.  Accelerating Wavelet Lifting on Graphics Hardware Using CUDA , 2011, IEEE Transactions on Parallel and Distributed Systems.

[2]  Antonio Plaza,et al.  GPU implementation of JPEG2000 for hyperspectral image compression , 2011, Remote Sensing.

[3]  Susan S. Young,et al.  JPEG 2000 compression of medical imagery , 2000, Medical Imaging.

[4]  Michal Kierzynka,et al.  Efficient Isosurface Extraction Using Marching Tetrahedra and Histogram Pyramids on Multiple GPUs , 2011, PPAM.

[5]  Kurt Keutzer,et al.  Considerations When Evaluating Microprocessor Platforms , 2011, HotPar.

[6]  Enrico Magli,et al.  Transform Coding Techniques for Lossy Hyperspectral Data Compression , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[7]  Satoshi Matsuoka,et al.  Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[8]  Alex Fit-Florea,et al.  Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[9]  Nagiza F. Samatova,et al.  Lessons Learned from Exploring the Backtracking Paradigm on the GPU , 2011, Euro-Par.

[10]  Sridha Sridharan,et al.  Gaze-J2K: Gaze-influenced image voding using eye trackers and JPEG 2000 , 2006 .

[11]  Mircea Andrecut,et al.  Parallel GPU Implementation of Iterative PCA Algorithms , 2008, J. Comput. Biol..

[12]  Sanketh Datla,et al.  Parallelizing Motion JPEG 2000 with CUDA , 2009, 2009 Second International Conference on Computer and Electrical Engineering.

[13]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[14]  Michael Lang,et al.  A Performance Evaluation of the Nehalem Quad-Core Processor for Scientific Computing , 2008, Parallel Process. Lett..

[15]  Petr Holub,et al.  GPU-Based Sample-Parallel Context Modeling for EBCOT in JPEG2000 , 2010, MEMICS.

[16]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[17]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[18]  Jacek Blazewicz,et al.  Protein alignment algorithms with an efficient backtracking routine on multiple GPUs , 2011, BMC Bioinformatics.

[19]  David A. Bader,et al.  Computing discrete transforms on the Cell Broadband Engine , 2009, Parallel Comput..

[20]  Pawel Gepner,et al.  Evaluation of Executing DGEMM Algorithms on Modern Multicore CPU , 2011 .

[21]  Pawel Gepner,et al.  Parallel application benchmarks and performance evaluation of the Intel Xeon 7500 family processors , 2011, ICCS.

[22]  David S. Taubman,et al.  High performance scalable image compression with EBCOT. , 2000, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[23]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[24]  Stewart Taylor,et al.  Optimizing Applications for Multi-Core Processors, Using the Intel® Integrated Performance Primitives, Second Edition , 2007 .

[25]  Manuel E. Acacio,et al.  A Parallel Implementation of the 2D Wavelet Transform Using CUDA , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[26]  Jacek Blazewicz,et al.  G-MSA - A GPU-based, fast and accurate algorithm for multiple sequence alignment , 2013, J. Parallel Distributed Comput..

[27]  Chih-Hsien Hsia,et al.  High Efficiency EBCOT with Parallel Coding Architecture for JPEG2000 , 2006, EURASIP J. Adv. Signal Process..

[28]  Pawel Gepner,et al.  Early Performance Evaluation of New Six-Core Intel® Xeon® 5600 Family Processors for HPC , 2010, 2010 Ninth International Symposium on Parallel and Distributed Computing.

[29]  Michal Kierzynka,et al.  CaKernel --A parallel application programming framework for heterogenous computing architectures , 2011 .