Benchmarking Data and Compute Intensive Applications on Modern CPU and GPU Architectures

Abstract The use of graphics hardware for non-graphics applications has become popular among many scientific programmers and researchers as we have observed a higher rate of theoretical performance increase than the CPUs in recent years. However, performance gains may be easily lost in the context of a specific parallel application due to various both hardware and software factors. Consequently, software benchmarks and performance testing are still the best techniques to compare the effciency of emerging parallel architectures with the built-in support for parallelism at different levels. Unfortunately, many available benchmarks are either relatively simple application kernels, they have been optimized only for a certain parallel architecture or they do not take advantage of recent capabilities provided by modern hardware and low level APIs. Thus, the main aim of this paper is to present a comprehensive real performance analysis of selected applications following the complex standard for data compression and coding -JPEG 2000. It consists of a chain of data and compute intensive tasks that can be treated as good examples of software benchmarks for modern parallel hardware architectures. In this paper we compare achieved performance results of our standard based benchmarks executed on selected architectures for different data sets to identify possible bottlenecks. We discuss also best practices and advices for parallel software development to help users to evaluate in advance and then select appropriate solutions to accelerate the execution of their applications.

[1]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[2]  Ware Myers Supercomputing 91 , 1992 .

[3]  Kurt Keutzer,et al.  Considerations When Evaluating Microprocessor Platforms , 2011, HotPar.

[4]  Jiří Matela GPU-Based DWT Acceleration for JPEG2000 , 2009 .

[5]  Stewart Taylor,et al.  Optimizing Applications for Multi-Core Processors, Using the Intel® Integrated Performance Primitives, Second Edition , 2007 .

[6]  Pawel Gepner,et al.  Early Performance Evaluation of New Six-Core Intel® Xeon® 5600 Family Processors for HPC , 2010, 2010 Ninth International Symposium on Parallel and Distributed Computing.

[7]  Nagiza F. Samatova,et al.  Lessons Learned from Exploring the Backtracking Paradigm on the GPU , 2011, Euro-Par.

[8]  Jos B. T. M. Roerdink,et al.  Accelerating Wavelet Lifting on Graphics Hardware Using CUDA , 2011, IEEE Transactions on Parallel and Distributed Systems.

[9]  Sanketh Datla,et al.  Parallelizing Motion JPEG 2000 with CUDA , 2009, 2009 Second International Conference on Computer and Electrical Engineering.

[10]  Michal Kierzynka,et al.  Efficient Isosurface Extraction Using Marching Tetrahedra and Histogram Pyramids on Multiple GPUs , 2011, PPAM.

[11]  Enrico Magli,et al.  Transform Coding Techniques for Lossy Hyperspectral Data Compression , 2007, IEEE Transactions on Geoscience and Remote Sensing.

[12]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[13]  Jacek Blazewicz,et al.  G-MSA - A GPU-based, fast and accurate algorithm for multiple sequence alignment , 2013, J. Parallel Distributed Comput..

[14]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[15]  Susan S. Young,et al.  JPEG 2000 compression of medical imagery , 2000, Medical Imaging.

[16]  Mircea Andrecut,et al.  Parallel GPU Implementation of Iterative PCA Algorithms , 2008, J. Comput. Biol..

[17]  Michal Kierzynka,et al.  CaKernel --A parallel application programming framework for heterogenous computing architectures , 2011 .

[18]  Petr Holub,et al.  GPU-Based Sample-Parallel Context Modeling for EBCOT in JPEG2000 , 2010, MEMICS.

[19]  Alex Fit-Florea,et al.  Precision and Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs , 2011 .

[20]  Manuel E. Acacio,et al.  A Parallel Implementation of the 2D Wavelet Transform Using CUDA , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[21]  Pawel Gepner,et al.  Parallel application benchmarks and performance evaluation of the Intel Xeon 7500 family processors , 2011, ICCS.

[22]  Jacek Blazewicz,et al.  Protein alignment algorithms with an efficient backtracking routine on multiple GPUs , 2011, BMC Bioinformatics.

[23]  David A. Bader,et al.  Computing discrete transforms on the Cell Broadband Engine , 2009, Parallel Comput..

[24]  David S. Taubman,et al.  High performance scalable image compression with EBCOT , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[25]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[26]  Antonio Plaza,et al.  GPU implementation of JPEG2000 for hyperspectral image compression , 2011, Remote Sensing.

[27]  Pradeep Dubey,et al.  Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort , 2010, SIGMOD Conference.

[28]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[29]  Michael Lang,et al.  A Performance Evaluation of the Nehalem Quad-Core Processor for Scientific Computing , 2008, Parallel Process. Lett..

[30]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[31]  Chih-Hsien Hsia,et al.  High Efficiency EBCOT with Parallel Coding Architecture for JPEG2000 , 2006, EURASIP J. Adv. Signal Process..

[32]  Pawel Gepner,et al.  Evaluation of Executing DGEMM Algorithms on Modern Multicore CPU , 2011 .

[33]  Satoshi Matsuoka,et al.  Auto-tuning 3-D FFT library for CUDA GPUs , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.