Redefining the Role of the CPU in the Era of CPU-GPU Integration

We've seen the quick adoption of GPUs as general-purpose computing engines in recent years, fueled by high computational throughput and energy efficiency. There is heavier integration of the CPU and GPU, including the GPU appearing on the same die, further decreasing barriers to the use of the GPU to offload the CPU. Much effort has been made to adapt GPU designs to anticipate this new partitioning of the computation space, including better programming models and more general processing units with support for control flow. However, researchers have placed little attention on the CPU and how it must adapt to this change. This article demonstrates that the coming era of CPU and GPU integration requires us to rethink the CPU's design and architecture. We show that the code the CPU will run, once appropriate computations are mapped to the GPU, has significantly different characteristics than the original code (which previously would have been mapped entirely to the CPU).

[1]  Timothy G. Mattson,et al.  OpenCL Programming Guide , 2011 .

[2]  Mateo Valero,et al.  Toward kilo-instruction processors , 2004, TACO.

[3]  Lieven Eeckhout,et al.  Microarchitecture-Independent Workload Characterization , 2007, IEEE Micro.

[4]  Matthew D. Sinclair,et al.  Porting CMP Benchmarks to GPUs , 2011 .

[5]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[6]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[7]  M. Vignesh,et al.  Scope for performance enhancement of CMU Sphinx by parallelising with OpenCL , 2011 .

[8]  Norman P. Jouppi,et al.  Core architecture optimization for heterogeneous chip multiprocessors , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[9]  Arun K. Somani,et al.  Unstructured grid applications on GPU: performance analysis and improvement , 2011, GPGPU-4.

[10]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[11]  Emilio L. Zapata,et al.  Simulation of quantum gates on a novel GPU architecture , 2007 .

[12]  John Paul Walters,et al.  Evaluating the use of GPUs in liver image segmentation and HMMER database searches , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[13]  Dirk Grunwald,et al.  A stateless, content-directed data prefetching mechanism , 2002, ASPLOS X.

[14]  Shane Ryoo,et al.  Performance insights on executing non-graphics applications on CUDA on the NVIDIA GeForce 8800 GTX , 2007, 2007 IEEE Hot Chips 19 Symposium (HCS).

[15]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Brad Calder,et al.  Pointer cache assisted prefetching , 2002, MICRO.

[17]  Volodymyr Kindratenko,et al.  MILC on GPUs , 2011 .

[18]  André Seznec,et al.  The L-TAGE Branch Predictor , 2007, J. Instr. Level Parallelism.

[19]  Douglas J. Joseph,et al.  Prefetching Using Markov Predictors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[20]  Kevin Skadron,et al.  Parallelization of particle filter algorithms , 2010, ISCA'10.

[21]  Tao Tang,et al.  Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+ , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[22]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[23]  Kevin Skadron,et al.  Experiences Accelerating MATLAB Systems Biology Applications , 2009 .