论文信息 - MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores

MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores

AbstractThe CUDA programming model, which is based on an extended ANSI C language and aruntime environment, allows the programmer to specify explicitly data parallel computation.NVIDIA developed CUDA to open the architecture of their graphics accelerators to moregeneral applications, but did not provide an efcient mapping to execute the programmingmodel on any other architecture.This document describes Multicore-CUDA (MCUDA), a system that efciently maps theCUDA programming model to a multicore CPU architecture. The major contribution of thiswork is the source-to-source translation process that converts CUDA code into standard Cthat interfaces to a runtime library for parallel execution. We apply the MCUDA frame-work to some CUDA applications previously shown to have high performance on a GPU, anddemonstrate high efcienc y executing these applications on a multicore CPU architecture. Thethread-level parallelism, data locality and computational regularity of the code as expressed inthe CUDA model achieve much of the benet of hand-tuning an application for the CPU ar-chitecture. With the MCUDA framework, it is now possible to write data-parallel code in asingle programming model for efcient execution on CPU or GPU architectures.

Sam S. Stone | John A. Stratton | and Wen-mei W. Hwu | S. S. Stone | J. Stratton

[1] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools (2nd Edition) , 2006 .

[2] Rudolf Eigenmann,et al. Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[3] J.H. Cowie,et al. Modeling the global Internet , 1999, Comput. Sci. Eng..

[4] Jack J. Dongarra,et al. Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[5] Alfred V. Aho,et al. Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[6] No License,et al. Intel ® 64 and IA-32 Architectures Software Developer ’ s Manual Volume 3 A : System Programming Guide , Part 1 , 2006 .

[7] Zhaohui Du,et al. Data and computation transformations for Brook streaming applications on multiprocessors , 2006, International Symposium on Code Generation and Optimization (CGO'06).

[8] P. Slusallek,et al. RPU: a programmable ray processing unit for realtime ray tracing , 2005, SIGGRAPH '05.

[9] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[10] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[11] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.