A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIA's C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications.

[1]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[2]  Harilaos N. Koutsopoulos,et al.  A microscopic traffic simulator for evaluation of dynamic traffic management systems , 1996 .

[3]  Anselmo Lastra,et al.  Simulation of cloud dynamics on graphics hardware , 2003, HWWS '03.

[4]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[5]  Kevin Stammetti Testing the Feasibility of Running a Computationally Intensive Real-time Traffic Simulation on a Multicore Programmable Graphics Processor , .

[6]  Fernando Gustavo Tinetti,et al.  Parallel programming: techniques and applications using networked workstations and parallel computers. Barry Wilkinson, C. Michael Allen , 2000 .

[7]  H. Peter Hofstee,et al.  Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[8]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2005, SIGGRAPH Courses.

[9]  Greg Humphreys,et al.  How GPUs Work , 2007, Computer.

[10]  William R. Mark,et al.  Cg: a system for programming graphics hardware in a C-like language , 2003, ACM Trans. Graph..

[11]  Kamal Jadidy Aval,et al.  Parallel sorting on ILLIAC array processor , 2007 .

[12]  Randi J. Rost OpenGL shading language , 2004 .

[13]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[14]  Michael D. McCool,et al.  Metaprogramming GPUs with Sh , 2004 .

[15]  G. Vahala,et al.  3D Entropic Lattice Boltzmann Simulations of 3D Navier-Stokes Turbulence , 2005 .

[16]  Bingsheng He,et al.  GPUQP: query co-processing using graphics processors , 2007, SIGMOD '07.

[17]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18]  Scott T. Acton,et al.  Speckle reducing anisotropic diffusion , 2002, IEEE Trans. Image Process..

[19]  Kunle Olukotun,et al.  Niagara: a 32-way multithreaded Sparc processor , 2005, IEEE Micro.

[20]  Yao Zhang,et al.  Scan primitives for GPU computing , 2007, GH '07.

[21]  John Shalf,et al.  The Cactus Framework and Toolkit: Design and Applications , 2002, VECPAR.

[22]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[23]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, ACM Trans. Graph..

[24]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[25]  Oliver Vornberger,et al.  Parallel Back-Propagation for Sales Prediction on Transputer Systems , 1995 .

[26]  Amitabh Varshney,et al.  High-throughput sequence alignment using Graphics Processing Units , 2007, BMC Bioinformatics.

[27]  Klaus Schulten,et al.  GPU acceleration of cutoff pair potentials for molecular modeling applications , 2008, CF '08.

[28]  A. Choudhary,et al.  Nu-minebench 2.0 , 2005 .

[29]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  Tom Blank,et al.  The MasPar MP-1 architecture , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[31]  Michael Boyer Automated Dynamic Analysis of CUDA Programs , 2008 .

[32]  Weiguo Liu,et al.  Streaming Algorithms for Biological Sequence Alignment on GPUs , 2007, IEEE Transactions on Parallel and Distributed Systems.

[33]  Compcon Digest of papers : Compcon Spring 90, February 26-March 2, 1990, Thirty-fifth IEEE Computer Society International Conference, San Francisco : Intellectual leverage , 1990 .

[34]  Rüdiger Westermann,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, SIGGRAPH Courses.

[35]  Jim X. Chen,et al.  OpenGL Shading Language , 2009 .

[36]  Sanjay J. Patel,et al.  Implicitly Parallel Programming Models for Thousand-Core Microprocessors , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[37]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.