MCUDA: An Efficient Implementation of CUDA Kernels on Multi-cores

AbstractThe CUDA programming model, which is based on an extended ANSI C language and aruntime environment, allows the programmer to specify explicitly data parallel computation.NVIDIA developed CUDA to open the architecture of their graphics accelerators to moregeneral applications, but did not provide an efcient mapping to execute the programmingmodel on any other architecture.This document describes Multicore-CUDA (MCUDA), a system that efciently maps theCUDA programming model to a multicore CPU architecture. The major contribution of thiswork is the source-to-source translation process that converts CUDA code into standard Cthat interfaces to a runtime library for parallel execution. We apply the MCUDA frame-work to some CUDA applications previously shown to have high performance on a GPU, anddemonstrate high efcienc y executing these applications on a multicore CPU architecture. Thethread-level parallelism, data locality and computational regularity of the code as expressed inthe CUDA model achieve much of the benet of hand-tuning an application for the CPU ar-chitecture. With the MCUDA framework, it is now possible to write data-parallel code in asingle programming model for efcient execution on CPU or GPU architectures.