论文信息 - Improving the global memory efficiency in GPU-based systems

Improving the global memory efficiency in GPU-based systems

of the Dissertation Improving the Global Memory Efficiency in GPU-Based Systems by Amir Kavyan Ziabari Doctor of Philosophy in Electrical and Computer Engineering Northeastern University, December 2016 Dr. David Kaeli, Adviser Graphics Processing Units (GPUs) have been used in a wide range of high performance computing domains. Unfortunately, computing with GPU devices presents its own challenges, including inefficiencies in the global memory system. With today’s growing demand for Big Data processing, the need to leverage larger-scale GPUs or multiple GPUs becomes the natural next step. Big Data applications magnify the current limitations of global memory on GPU-based systems. A major source of this global memory inefficiency is due to bottlenecks in the on-chip network associated with this memory. In this dissertation, we describe how to optimize the performance and power efficiency of an on-chip network used on a GPU. We explore the GPU-based Network-on-Chip (NoC) design space, develop execution-driven simulation models, and analyze a range of parallelized applications. We evaluate a number of conventional network topologies, and their impact on performance of a GPU system. We use detailed simulation to characterize memory access patterns present in the GPU applications, and explore electrical on-chip networks that best match the needs of these applications. We incorporate asymmetry into the NoC design as a solution to reduce the power consumption of a network, while providing comparable performance to the best conventional topology. Our solution reduces the Energy-Delay Product (EDP) by as much as 88%. In order to improve the performance of current and future GPUs, we explore the use of silicon-photonic link technology when constructing the NoC. This emerging, low-latency, highbandwidth technology has been incorporated in chip multiproccesors (CMPs). By introducing a hybrid silicon-photonic NoC in the GPU memory system, we are able to improve performance of memory-intensive applications by 3.43×, as compared with the best alternative electrical NoC. Finally, we conduct a thorough analysis of global memory management schemes for multi-GPU systems. We identify limitations of the global memory present in previously proposed

Amir Kavyan Ziabari

[1] Thomas B. Jablin,et al. Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes , 2015, ICS.

[2] Hai Jiang,et al. MGMR: Multi-GPU Based MapReduce , 2013, GPC.

[3] Erik Hagersten,et al. Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead , 2016, ACM Trans. Archit. Code Optim..

[4] David R. Kaeli,et al. Data Structures and Transformations for Physically Based Simulation on a GPU , 2010, VECPAR.

[5] Bradford M. Beckmann,et al. Software Assisted Hardware Cache Coherence for Heterogeneous Processors , 2016, MEMSYS.

[6] Thomas F. Wenisch,et al. Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[7] Mark Horowitz,et al. Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[8] George Kurian,et al. ATAC: Improving performance and programmability with on-chip optical networks , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[9] Tao Li,et al. Exploring Silicon Nanophotonics in Throughput Architecture , 2014, IEEE Design & Test.

[10] Jürgen Schmidhuber,et al. Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[11] David Brooks,et al. Architectural power models for SRAM and CAM structures based on hybrid analytical/empirical techniques , 2007, ICCAD 2007.

[12] Jeffrey S. Vetter,et al. Quantifying NUMA and contention effects in multi-GPU systems , 2011, GPGPU-4.

[13] I Aguilar,et al. Breeding and Genetics Symposium: really big data: processing and analysis of very large data sets. , 2012, Journal of animal science.

[14] Gil Neiger,et al. Intel ® Virtualization Technology for Directed I/O , 2006 .

[15] Mike Houston,et al. A closer look at GPUs , 2008, Commun. ACM.

[16] Cary Gunn,et al. CMOS Photonics for High-Speed Interconnects , 2006, IEEE Micro.

[17] Sarita V. Adve,et al. Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18] Henry Hoffmann,et al. On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.

[19] Chung-Ta King,et al. Traffic-aware frequency scaling for balanced on-chip networks on GPGPUs , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[20] Babak Falsafi,et al. NOC-Out: Microarchitecting a Scale-Out Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[21] David H. Albonesi,et al. Phastlane: a rapid transit optical routing network , 2009, ISCA '09.

[22] Natalie D. Enright Jerger,et al. NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free? , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[23] John Kim,et al. Providing cost-effective on-chip network bandwidth in GPGPUs , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[24] William J. Dally,et al. Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[25] Jie Sun,et al. Nanophotonic integration in state-of-the-art CMOS foundries. , 2011, Optics express.

[26] John E. Stone,et al. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[27] Bongjae Kim,et al. Acceleration of Computational Fluid Dynamics Analysis by using Multiple GPUs , 2016 .

[28] Scott A. Mahlke,et al. SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration , 2015, ACM Trans. Comput. Syst..

[29] Avinash Kodi,et al. Energy-efficient optical Network-on-Chip architecture for heterogeneous multicores , 2016, 2016 IEEE Optical Interconnects Conference (OI).

[30] William J. Dally. Virtual-channel flow control , 1990, ISCA '90.

[31] Xavier Llorà,et al. Large‐scale data mining using genetics‐based machine learning , 2013, GECCO.

[32] Sang Lyul Min,et al. U-cache: a cost-effective solution to synonym problem , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.

[33] Bingsheng He,et al. gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space , 2016, USENIX Annual Technical Conference.

[34] David R. Kaeli,et al. Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[35] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[36] Babak Falsafi,et al. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[37] Sudhakar Yalamanchili,et al. Interconnection Networks: An Engineering Approach , 2002 .

[38] Lieven Eeckhout,et al. A low-cost conflict-free NoC for GPGPUs , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[39] Norman P. Jouppi,et al. Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[40] Amnon Barak,et al. Memory access patterns: the missing piece of the multi-GPU puzzle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41] Stefanos Kaxiras,et al. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[42] David A. Wood,et al. Heterogeneous-race-free memory models , 2014, ASPLOS.

[43] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[44] J. Thomas Pawlowski,et al. Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).

[45] Robert Schreiber,et al. Transient Simulation of Nonlinear Electro-Quasi-Static Field Problems Accelerated by Multiple GPUs , 2016, IEEE Transactions on Magnetics.

[46] Simon W. Moore,et al. Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[47] William J. Dally,et al. Digital systems engineering , 1998 .

[48] Jungwon Kim,et al. Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.

[49] Anantha Chandrakasan,et al. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI , 2012, DAC Design Automation Conference 2012.

[50] Tao Li,et al. Integrating nanophotonics in GPU microarchitecture , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[51] William J. Dally,et al. Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..

[52] John Kim,et al. Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[53] Zhaolin Li,et al. A power-efficient network-on-chip for multi-core stream processors , 2013, 2013 IEEE 10th International Conference on ASIC.

[54] Justin Schauer,et al. High Speed and Low Energy Capacitively Driven On-Chip Wires , 2008, IEEE Journal of Solid-State Circuits.

[55] Vladimir Stojanovic,et al. Designing Energy-Efficient Low-Diameter On-Chip Networks with Equalized Interconnects , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[56] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[57] Xi Chen,et al. Iris: A hybrid nanophotonic network design for high-performance and low-power on-chip communication , 2011, JETC.

[58] Gabriel H. Loh,et al. Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[59] Francisco Tirado,et al. NMF-mGPU: non-negative matrix factorization on multi-GPU systems , 2015, BMC Bioinformatics.

[60] Luca P. Carloni,et al. On the Design of a Photonic Network-on-Chip , 2007, First International Symposium on Networks-on-Chip (NOCS'07).

[61] Xiangyu Li,et al. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[62] G. Sohi,et al. A static power model for architects , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[63] Babak Falsafi,et al. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.

[64] Mark D. Hill,et al. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[65] Qi-Jun Zhang,et al. Parallel back-propagation neural network training technique using CUDA on multiple GPUs , 2015, 2015 IEEE MTT-S International Conference on Numerical Electromagnetic and Multiphysics Modeling and Optimization (NEMO).

[66] David Kaeli,et al. The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , 2011 .

[67] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[68] Tor M. Aamodt,et al. Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[69] Wei Zhang,et al. A low-power fat tree-based optical Network-On-Chip for multiprocessor system-on-chip , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[70] Alex Ramírez,et al. Designing Efficient Heterogeneous Memory Architectures , 2015, IEEE Micro.

[71] Murray Cole,et al. Autotuning Wavefront Applications for Multicore Multi-GPU Hybrid Architectures , 2014, PMAM.

[72] Vladimir Stojanovic,et al. A 4Gb/s/ch 356fJ/b 10mm equalized on-chip interconnect with nonlinear charge-injecting transmit filter and transimpedance receiver in 90nm CMOS , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[73] Mark D. Hill,et al. Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap , 2012, IEEE Micro.

[74] José F. Martínez,et al. A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing , 2010, ASPLOS XV.

[75] Yan Solihin,et al. CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[76] Gabriel H. Loh,et al. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[77] Wei Zhang,et al. A Torus-Based Hierarchical Optical-Electronic Network-on-Chip for Multiprocessor System-on-Chip , 2012, JETC.

[78] William J. Dally,et al. Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.

[79] Mateo Valero,et al. Bandwidth of Crossbar and Multiple-Bus Connections for Multiprocessors , 1982, IEEE Transactions on Computers.

[80] Niraj K. Jha,et al. Token flow control , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[81] Marco Ajmone Marsan,et al. Markov Models for Multiple Bus Multiprocessor Systems , 1982, IEEE Transactions on Computers.

[82] Onur Mutlu,et al. Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[83] Norman P. Jouppi,et al. Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[84] David R. Kaeli,et al. Asymmetric NoC Architectures for GPU Systems , 2015, NOCS.

[85] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[86] S. Borkar,et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[87] Jinchun Kim,et al. Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[88] John Ayer,et al. Understanding Performance of PCI Express Systems , 2008 .

[89] Natalie D. Enright Jerger,et al. Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.

[90] Chen Sun,et al. A 1.23pJ/b 2.5Gb/s monolithically integrated optical carrier-injection ring modulator and all-digital driver circuit in commercial 45nm SOI , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[91] John Kim,et al. Multi-GPU System Design with Memory Networks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[92] Chung-Ta King,et al. Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs , 2014, NPC.

[93] Sebastian Schöps,et al. Multi-GPU Acceleration of Algebraic Multigrid Preconditioners , 2016 .

[94] David A. Wood,et al. Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[95] Jaeha Kim,et al. Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[96] Kevin Skadron,et al. HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects , 2003 .

[97] Jaejin Lee,et al. A 1.2 V 8 Gb 8-Channel 128 GB/s High-Bandwidth Memory (HBM) Stacked DRAM With Effective I/O Test Circuits , 2015, IEEE Journal of Solid-State Circuits.

[98] Niraj K. Jha,et al. Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.

[99] Mike Mantor,et al. AMD Radeon™ HD 7970 with graphics core next (GCN) architecture , 2012, 2012 IEEE Hot Chips 24 Symposium (HCS).

[100] Christopher Batten,et al. Silicon-photonic clos networks for global on-chip communication , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.

[101] Sarita V. Adve,et al. Stash: Have your scratchpad and cache it too , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[102] David R. Kaeli,et al. Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[103] William J. Dally,et al. A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[104] Sudhakar Yalamanchili,et al. Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures , 2013, ACM Trans. Design Autom. Electr. Syst..

[105] Yu Zhang,et al. Firefly: illuminating future network-on-chip with nanophotonics , 2009, ISCA '09.

[106] David R. Kaeli,et al. Leveraging Silicon-Photonic NoC for Designing Scalable GPUs , 2015, ICS.

[107] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[108] Mikko H. Lipasti,et al. Light speed arbitration and flow control for nanophotonic interconnects , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[109] Klaus Schulten,et al. Fast Visualization of Gaussian Density Surfaces for Molecular Dynamics and Particle System Trajectories , 2012, EuroVis.

[110] Thomas F. Wenisch,et al. Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[111] K. Saban. Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity , Bandwidth , and Power Efficiency , 2009 .

[112] Shaahin Hessabi,et al. All-Optical Wavelength-Routed Architecture for a Power-Efficient Network on Chip , 2014, IEEE Transactions on Computers.

[113] Xiang Zhang,et al. A multilayer nanophotonic interconnection network for on-chip many-core communications , 2010, Design Automation Conference.

[114] Laxmi N. Bhuyan,et al. A dynamic cache sub-block design to reduce false sharing , 1995, Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors.

[115] Jaewon Lee,et al. GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[116] William J. Dally,et al. Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[117] Christopher Batten,et al. Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.

[118] Chao Chen,et al. Runtime Management of Laser Power in Silicon-Photonic Multibus NoC Architecture , 2013, IEEE Journal of Selected Topics in Quantum Electronics.

[119] D. Drew. Mark , 2005, Neonatal Network.

[120] Simon See,et al. An Evaluation of Unified Memory Technology on NVIDIA GPUs , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[121] Sharad Malik,et al. Power-driven design of router microarchitectures in on-chip networks , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[122] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[123] Holger Fröning,et al. Exploring LLVM Infrastructure for Simplified Multi-GPU Programming , 2016 .

[124] Alyssa B. Apsel,et al. Leveraging Optical Technology in Future Bus-based Chip Multiprocessors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[125] David Kaeli,et al. Heterogeneous Computing with OpenCL 2.0 , 2015 .

[126] Eisse Mensink,et al. A 0.28pJ/b 2Gb/s/ch Transceiver in 90nm CMOS for 10mm On-Chip interconnects , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[127] Mikko H. Lipasti,et al. Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[128] Sudhakar Yalamanchili,et al. Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Architecture , 2012 .

[129] Sharad Malik,et al. A Power Model for Routers: Modeling Alpha 21364 and InfiniBand Routers , 2003, IEEE Micro.

[130] Ahmed Louri,et al. Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[131] V AdveSarita,et al. Weak orderinga new definition , 1990 .

[132] Stephen W. Keckler,et al. Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[133] Rajat Raina,et al. Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[134] Roy D. Sleator,et al. 'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.

[135] David R. Kaeli,et al. UMH , 2016, ACM Trans. Archit. Code Optim..

[136] Jaewon Lee,et al. ScaleGPU: GPU Architecture for Memory-Unaware GPU Programming , 2014, IEEE Computer Architecture Letters.

[137] Mahmut T. Kandemir,et al. Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[138] Chita R. Das,et al. A case for heterogeneous on-chip interconnects for CMPs , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[139] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[140] Michal Lipson,et al. Optical 4x4 hitless Silicon router for optical Networks-on-Chip (NoC): erratum , 2008 .

[141] Jung Ho Ahn,et al. Corona: System Implications of Emerging Nanophotonic Technology , 2008, 2008 International Symposium on Computer Architecture.

[142] Bingsheng He,et al. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[143] David R. Kaeli,et al. Data transformations enabling loop vectorization on multithreaded data parallel architectures , 2010, PPoPP '10.

[144] Hsien-Hsin S. Lee,et al. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.