A High-Performance Multi-Element Processing Framework on GPUs

Many computational engineering problems ranging from finite element methods to image processing involve the batch processing on a large number of data items. While multielement processing has the potential to harness computational power of parallel systems, current techniques often concentrate on maximizing elemental performance. Frameworks that take this greedy optimization approach often fail to extract the maximum processing power of the system for multi-element processing problems. By ultilizing the knowledge that the same operation will be accomplished on a large number of items, we can organize the computation to maximize the computational throughput available in parallel streaming hardware. In this paper, we analyzed weaknesses of existing methods and we proposed efficient parallel programming patterns implemented in a high performance multi-element processing framework to harness the processing power of GPUs. Our approach is capable of levering out the performance curve even on the range of small element size. A High-Performance Multi-Element Processing Framework on GPUs Linh Ha, James King, Zhisong Fu and Robert M. Kirby Scientific Computing and Imaging Institute University of Utah Email: {lha, jsking, zhisong, kirby}@sci.utah.edu Abstract—Many computational engineering problems ranging from finite element methods to image processing involve the batch processing on a large number of data items. While multielement processing has the potential to harness computational power of parallel systems, current techniques often concentrate on maximizing elemental performance. Frameworks that take this greedy optimization approach often fail to extract the maximum processing power of the system for multi-element processing problems. By ultilizing the knowledge that the same operation will be accomplished on a large number of items, we can organize the computation to maximize the computational throughput available in parallel streaming hardware. In this paper, we analyzed weaknesses of existing methods and we proposed efficient parallel programming patterns implemented in a high performance multi-element processing framework to harness the processing power of GPUs. Our approach is capable of levering out the performance curve even on the range of small element size.Many computational engineering problems ranging from finite element methods to image processing involve the batch processing on a large number of data items. While multielement processing has the potential to harness computational power of parallel systems, current techniques often concentrate on maximizing elemental performance. Frameworks that take this greedy optimization approach often fail to extract the maximum processing power of the system for multi-element processing problems. By ultilizing the knowledge that the same operation will be accomplished on a large number of items, we can organize the computation to maximize the computational throughput available in parallel streaming hardware. In this paper, we analyzed weaknesses of existing methods and we proposed efficient parallel programming patterns implemented in a high performance multi-element processing framework to harness the processing power of GPUs. Our approach is capable of levering out the performance curve even on the range of small element size.

[1]  Jack J. Dongarra,et al.  Optimizing symmetric dense matrix-vector multiplication on GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[2]  J. Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[4]  G. Karniadakis,et al.  Spectral/hp Element Methods for CFD , 1999 .

[5]  Jack Dongarra,et al.  Empirical Performance Tuning of Dense Linear Algebra Software , 2010 .

[6]  C. Schwab P- and hp- finite element methods : theory and applications in solid and fluid mechanics , 1998 .

[7]  Alexei A. Efros,et al.  Scene completion using millions of photographs , 2007, SIGGRAPH 2007.

[8]  George Em Karniadakis,et al.  Spectral Element and hp Methods , 2004 .

[9]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[10]  Robert L. Stevenson,et al.  Super-resolution from image sequences-a review , 1998, 1998 Midwest Symposium on Circuits and Systems (Cat. No. 98CB36268).

[11]  Robert Michael Kirby,et al.  To CG or to HDG: A Comparative Study , 2012, J. Sci. Comput..

[12]  A. M. Alattar A probabilistic filter for eliminating temporal noise in time-varying image sequences , 1992, [Proceedings] 1992 IEEE International Symposium on Circuits and Systems.

[13]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[14]  P. Thomas Fletcher,et al.  Population Shape Regression from Random Design Data , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Jack Dongarra,et al.  A Hybridization Methodology for High-Performance Linear Algebra Software for GPUs , 2012 .

[16]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[17]  Jill Macdonald Boyce,et al.  Noise reduction of image sequences using adaptive motion compensated frame averaging , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Ian Buck GPU Computing: Programming a Massively Parallel Processor , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[19]  William J. Dally,et al.  A bandwidth-efficient architecture for media processing , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[20]  V. Michael Bove,et al.  Cheops: a reconfigurable data-flow system for video processing , 1995, IEEE Trans. Circuits Syst. Video Technol..

[21]  Jens H. Krüger,et al.  ISP: An Optimal Out-of-Core Image-Set Processing Streaming Architecture for Parallel Heterogeneous Systems , 2012, IEEE Transactions on Visualization and Computer Graphics.

[22]  Frederic Dufaux,et al.  Motion estimation techniques for digital TV: a review and a new contribution , 1995, Proc. IEEE.

[23]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.

[24]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[25]  P. Fischer,et al.  High-Order Methods for Incompressible Fluid Flow , 2002 .

[26]  Huamin Wang,et al.  Factoring repeated content within and among images , 2008, ACM Trans. Graph..

[27]  Jack Dongarra,et al.  Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures , 2011 .

[28]  Jack J. Dongarra,et al.  Accelerating GPU Kernels for Dense Linear Algebra , 2010, VECPAR.

[29]  Robert A. van de Geijn,et al.  Using PLAPACK - parallel linear algebra package , 1997 .

[30]  Yizhou Yu,et al.  Particle-based simulation of granular materials , 2005, SCA '05.

[31]  P Boccacci,et al.  Super-resolution in computational imaging. , 2003, Micron.

[32]  Steven M. Seitz,et al.  Photo tourism: exploring photo collections in 3D , 2006, ACM Trans. Graph..

[33]  Luke N. Olson,et al.  Exposing Fine-Grained Parallelism in Algebraic Multigrid Methods , 2012, SIAM J. Sci. Comput..

[34]  D. Griffin,et al.  Finite-Element Analysis , 1975 .