论文信息 - Evaluating cache coherent shared virtual memory for heterogeneous multicore chips

Evaluating cache coherent shared virtual memory for heterogeneous multicore chips

Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current heterogeneous chip. In this paper, we present a CCSVM design for a CPU/GPU chip, as well as an extension of the pthreads programming model for programming this HMC. We experimentally compare CCSVM/xthreads to a state-of-the-art CPU/GPU chip from AMD that runs OpenCL software. CCSVM's more efficient communication enables far better performance and far fewer DRAM accesses.

Daniel J. Sorin | Blake A. Hechtman

[1] Leslie Lamport,et al. How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[2] Christopher Batten,et al. The vector-thread architecture , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[3] Patricia J. Teller. Translation-lookaside buffer consistency , 1990, Computer.

[4] William J. Dally,et al. Smart Memories: a modular reconfigurable architecture , 2000, ISCA '00.

[5] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[6] Kevin M. Lepak,et al. Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[7] Alan Jay Smith,et al. A class of compatible cache consistency protocols and their support by the IEEE futurebus , 1986, ISCA '86.

[8] Pradeep Dubey,et al. Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[9] Wu-chun Feng,et al. On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[10] Babak Falsafi,et al. Cuckoo directory: A scalable directory for many-core systems , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[11] Pat Conway,et al. The AMD Opteron Northbridge Architecture , 2007, IEEE Micro.

[12] Milind Girkar,et al. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system , 2007, PLDI '07.

[13] Hong Jiang,et al. Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[14] Kim M. Hazelwood,et al. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[15] Sanjay J. Patel,et al. Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[16] Daniel J. Sorin,et al. UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[17] Brad Burgess,et al. Bobcat: AMD's Low-Power x86 Processor , 2011, IEEE Micro.

[18] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[19] David A. Wood,et al. A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[20] Milo M. K. Martin,et al. Why on-chip cache coherence is here to stay , 2012, Commun. ACM.

[21] Hyesoon Kim,et al. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[22] H. Peter Hofstee,et al. Introduction to the Cell multiprocessor , 2005, IBM J. Res. Dev..

[23] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.

[24] H. Franke,et al. Introduction to the wire-speed processor and architecture , 2010, IBM J. Res. Dev..

[25] Eric M. Schwarz,et al. IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[26] Maurice Steinman,et al. AMD'S "LLANO" Fusion APU , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[27] Jungwon Kim,et al. COMIC++: A software SVM system for heterogeneous multicore accelerator clusters , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[28] Ronak Singhal,et al. Inside Intel® Core microarchitecture (Nehalem) , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[29] Daniel J. Sorin,et al. Exploring memory consistency for massively-threaded throughput-oriented processors , 2013, ISCA.

[30] Mike O'Connor,et al. Cache coherence for GPU architectures , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[31] Keshav Pingali,et al. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm , 2011 .

[32] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[33] Charlie Johnson,et al. IBM Power Edge of Network Processor: A Wire-Speed System on a Chip , 2011, IEEE Micro.

[34] R. J. Joenk,et al. IBM journal of research and development: information for authors , 1978 .

[35] Sanjay J. Patel,et al. Cohesion: a hybrid memory model for accelerators , 2010, ISCA.

[36] Peter Sewell,et al. A Better x86 Memory Model: x86-TSO , 2009, TPHOLs.

[37] N. Gura,et al. UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[38] Marcelo Yuffe,et al. A fully integrated multi-CPU, GPU and memory controller 32nm processor , 2011, 2011 IEEE International Solid-State Circuits Conference.