Improving the global memory efficiency in GPU-based systems
暂无分享,去创建一个
[1] Thomas B. Jablin,et al. Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes , 2015, ICS.
[2] Hai Jiang,et al. MGMR: Multi-GPU Based MapReduce , 2013, GPC.
[3] Erik Hagersten,et al. Building Heterogeneous Unified Virtual Memories (UVMs) without the Overhead , 2016, ACM Trans. Archit. Code Optim..
[4] David R. Kaeli,et al. Data Structures and Transformations for Physically Based Simulation on a GPU , 2010, VECPAR.
[5] Bradford M. Beckmann,et al. Software Assisted Hardware Cache Coherence for Heterogeneous Processors , 2016, MEMSYS.
[6] Thomas F. Wenisch,et al. Selective GPU caches to eliminate CPU-GPU HW cache coherence , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[7] Mark Horowitz,et al. Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.
[8] George Kurian,et al. ATAC: Improving performance and programmability with on-chip optical networks , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.
[9] Tao Li,et al. Exploring Silicon Nanophotonics in Throughput Architecture , 2014, IEEE Design & Test.
[10] Jürgen Schmidhuber,et al. Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[11] David Brooks,et al. Architectural power models for SRAM and CAM structures based on hybrid analytical/empirical techniques , 2007, ICCAD 2007.
[12] Jeffrey S. Vetter,et al. Quantifying NUMA and contention effects in multi-GPU systems , 2011, GPGPU-4.
[13] I Aguilar,et al. Breeding and Genetics Symposium: really big data: processing and analysis of very large data sets. , 2012, Journal of animal science.
[14] Gil Neiger,et al. Intel ® Virtualization Technology for Directed I/O , 2006 .
[15] Mike Houston,et al. A closer look at GPUs , 2008, Commun. ACM.
[16] Cary Gunn,et al. CMOS Photonics for High-Speed Interconnects , 2006, IEEE Micro.
[17] Sarita V. Adve,et al. Efficient GPU synchronization without scopes: Saying no to complex consistency models , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[18] Henry Hoffmann,et al. On-Chip Interconnection Architecture of the Tile Processor , 2007, IEEE Micro.
[19] Chung-Ta King,et al. Traffic-aware frequency scaling for balanced on-chip networks on GPGPUs , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).
[20] Babak Falsafi,et al. NOC-Out: Microarchitecting a Scale-Out Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[21] David H. Albonesi,et al. Phastlane: a rapid transit optical routing network , 2009, ISCA '09.
[22] Natalie D. Enright Jerger,et al. NoC Architectures for Silicon Interposer Systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free? , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[23] John Kim,et al. Providing cost-effective on-chip network bandwidth in GPGPUs , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).
[24] William J. Dally,et al. Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.
[25] Jie Sun,et al. Nanophotonic integration in state-of-the-art CMOS foundries. , 2011, Optics express.
[26] John E. Stone,et al. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.
[27] Bongjae Kim,et al. Acceleration of Computational Fluid Dynamics Analysis by using Multiple GPUs , 2016 .
[28] Scott A. Mahlke,et al. SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration , 2015, ACM Trans. Comput. Syst..
[29] Avinash Kodi,et al. Energy-efficient optical Network-on-Chip architecture for heterogeneous multicores , 2016, 2016 IEEE Optical Interconnects Conference (OI).
[30] William J. Dally. Virtual-channel flow control , 1990, ISCA '90.
[31] Xavier Llorà,et al. Large‐scale data mining using genetics‐based machine learning , 2013, GECCO.
[32] Sang Lyul Min,et al. U-cache: a cost-effective solution to synonym problem , 1995, Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture.
[33] Bingsheng He,et al. gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space , 2016, USENIX Annual Technical Conference.
[34] David R. Kaeli,et al. Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[35] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.
[36] Babak Falsafi,et al. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[37] Sudhakar Yalamanchili,et al. Interconnection Networks: An Engineering Approach , 2002 .
[38] Lieven Eeckhout,et al. A low-cost conflict-free NoC for GPGPUs , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).
[39] Norman P. Jouppi,et al. Combining memory and a controller with photonics through 3D-stacking to enable scalable and energy-efficient systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[40] Amnon Barak,et al. Memory access patterns: the missing piece of the multi-GPU puzzle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[41] Stefanos Kaxiras,et al. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[42] David A. Wood,et al. Heterogeneous-race-free memory models , 2014, ASPLOS.
[43] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[44] J. Thomas Pawlowski,et al. Hybrid memory cube (HMC) , 2011, 2011 IEEE Hot Chips 23 Symposium (HCS).
[45] Robert Schreiber,et al. Transient Simulation of Nonlinear Electro-Quasi-Static Field Problems Accelerated by Multiple GPUs , 2016, IEEE Transactions on Magnetics.
[46] Simon W. Moore,et al. Low-latency virtual-channel routers for on-chip networks , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..
[47] William J. Dally,et al. Digital systems engineering , 1998 .
[48] Jungwon Kim,et al. Achieving a single compute device image in OpenCL for multiple GPUs , 2011, PPoPP '11.
[49] Anantha Chandrakasan,et al. Approaching the theoretical limits of a mesh NoC with a 16-node chip prototype in 45nm SOI , 2012, DAC Design Automation Conference 2012.
[50] Tao Li,et al. Integrating nanophotonics in GPU microarchitecture , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[51] William J. Dally,et al. Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels , 1993, IEEE Trans. Parallel Distributed Syst..
[52] John Kim,et al. Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[53] Zhaolin Li,et al. A power-efficient network-on-chip for multi-core stream processors , 2013, 2013 IEEE 10th International Conference on ASIC.
[54] Justin Schauer,et al. High Speed and Low Energy Capacitively Driven On-Chip Wires , 2008, IEEE Journal of Solid-State Circuits.
[55] Vladimir Stojanovic,et al. Designing Energy-Efficient Low-Diameter On-Chip Networks with Equalized Interconnects , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.
[56] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[57] Xi Chen,et al. Iris: A hybrid nanophotonic network design for high-performance and low-power on-chip communication , 2011, JETC.
[58] Gabriel H. Loh,et al. Extending the effectiveness of 3D-stacked DRAM caches with an adaptive multi-queue policy , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[59] Francisco Tirado,et al. NMF-mGPU: non-negative matrix factorization on multi-GPU systems , 2015, BMC Bioinformatics.
[60] Luca P. Carloni,et al. On the Design of a Photonic Network-on-Chip , 2007, First International Symposium on Networks-on-Chip (NOCS'07).
[61] Xiangyu Li,et al. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).
[62] G. Sohi,et al. A static power model for architects , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.
[63] Babak Falsafi,et al. Die-stacked DRAM caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache , 2013, ISCA.
[64] Mark D. Hill,et al. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[65] Qi-Jun Zhang,et al. Parallel back-propagation neural network training technique using CUDA on multiple GPUs , 2015, 2015 IEEE MTT-S International Conference on Numerical Electromagnetic and Multiphysics Modeling and Optimization (NEMO).
[66] David Kaeli,et al. The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing , 2011 .
[67] Sarita V. Adve,et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[68] Tor M. Aamodt,et al. Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[69] Wei Zhang,et al. A low-power fat tree-based optical Network-On-Chip for multiprocessor system-on-chip , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.
[70] Alex Ramírez,et al. Designing Efficient Heterogeneous Memory Architectures , 2015, IEEE Micro.
[71] Murray Cole,et al. Autotuning Wavefront Applications for Multicore Multi-GPU Hybrid Architectures , 2014, PMAM.
[72] Vladimir Stojanovic,et al. A 4Gb/s/ch 356fJ/b 10mm equalized on-chip interconnect with nonlinear charge-injecting transmit filter and transimpedance receiver in 90nm CMOS , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.
[73] Mark D. Hill,et al. Supporting Very Large DRAM Caches with Compound-Access Scheduling and MissMap , 2012, IEEE Micro.
[74] José F. Martínez,et al. A power-efficient all-optical on-chip interconnect using wavelength-based oblivious routing , 2010, ASPLOS XV.
[75] Yan Solihin,et al. CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[76] Gabriel H. Loh,et al. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[77] Wei Zhang,et al. A Torus-Based Hierarchical Optical-Electronic Network-on-Chip for Multiprocessor System-on-Chip , 2012, JETC.
[78] William J. Dally,et al. Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.
[79] Mateo Valero,et al. Bandwidth of Crossbar and Multiple-Bus Connections for Multiprocessors , 1982, IEEE Transactions on Computers.
[80] Niraj K. Jha,et al. Token flow control , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.
[81] Marco Ajmone Marsan,et al. Markov Models for Multiple Bus Multiprocessor Systems , 1982, IEEE Transactions on Computers.
[82] Onur Mutlu,et al. Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[83] Norman P. Jouppi,et al. Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .
[84] David R. Kaeli,et al. Asymmetric NoC Architectures for GPU Systems , 2015, NOCS.
[85] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[86] S. Borkar,et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.
[87] Jinchun Kim,et al. Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).
[88] John Ayer,et al. Understanding Performance of PCI Express Systems , 2008 .
[89] Natalie D. Enright Jerger,et al. Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.
[90] Chen Sun,et al. A 1.23pJ/b 2.5Gb/s monolithically integrated optical carrier-injection ring modulator and all-digital driver circuit in commercial 45nm SOI , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.
[91] John Kim,et al. Multi-GPU System Design with Memory Networks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[92] Chung-Ta King,et al. Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs , 2014, NPC.
[93] Sebastian Schöps,et al. Multi-GPU Acceleration of Algebraic Multigrid Preconditioners , 2016 .
[94] David A. Wood,et al. Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[95] Jaeha Kim,et al. Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[96] Kevin Skadron,et al. HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects , 2003 .
[97] Jaejin Lee,et al. A 1.2 V 8 Gb 8-Channel 128 GB/s High-Bandwidth Memory (HBM) Stacked DRAM With Effective I/O Test Circuits , 2015, IEEE Journal of Solid-State Circuits.
[98] Niraj K. Jha,et al. Express virtual channels: towards the ideal interconnection fabric , 2007, ISCA '07.
[99] Mike Mantor,et al. AMD Radeon™ HD 7970 with graphics core next (GCN) architecture , 2012, 2012 IEEE Hot Chips 24 Symposium (HCS).
[100] Christopher Batten,et al. Silicon-photonic clos networks for global on-chip communication , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.
[101] Sarita V. Adve,et al. Stash: Have your scratchpad and cache it too , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[102] David R. Kaeli,et al. Exploring the multiple-GPU design space , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[103] William J. Dally,et al. A delay model and speculative architecture for pipelined routers , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.
[104] Sudhakar Yalamanchili,et al. Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures , 2013, ACM Trans. Design Autom. Electr. Syst..
[105] Yu Zhang,et al. Firefly: illuminating future network-on-chip with nanophotonics , 2009, ISCA '09.
[106] David R. Kaeli,et al. Leveraging Silicon-Photonic NoC for Designing Scalable GPUs , 2015, ICS.
[107] Stefanos Kaxiras,et al. Complexity-effective multicore coherence , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[108] Mikko H. Lipasti,et al. Light speed arbitration and flow control for nanophotonic interconnects , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[109] Klaus Schulten,et al. Fast Visualization of Gaussian Density Surfaces for Molecular Dynamics and Particle System Trajectories , 2012, EuroVis.
[110] Thomas F. Wenisch,et al. Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[111] K. Saban. Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity , Bandwidth , and Power Efficiency , 2009 .
[112] Shaahin Hessabi,et al. All-Optical Wavelength-Routed Architecture for a Power-Efficient Network on Chip , 2014, IEEE Transactions on Computers.
[113] Xiang Zhang,et al. A multilayer nanophotonic interconnection network for on-chip many-core communications , 2010, Design Automation Conference.
[114] Laxmi N. Bhuyan,et al. A dynamic cache sub-block design to reduce false sharing , 1995, Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors.
[115] Jaewon Lee,et al. GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[116] William J. Dally,et al. Microarchitecture of a high radix router , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[117] Christopher Batten,et al. Building Manycore Processor-to-DRAM Networks with Monolithic Silicon Photonics , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.
[118] Chao Chen,et al. Runtime Management of Laser Power in Silicon-Photonic Multibus NoC Architecture , 2013, IEEE Journal of Selected Topics in Quantum Electronics.
[119] D. Drew. Mark , 2005, Neonatal Network.
[120] Simon See,et al. An Evaluation of Unified Memory Technology on NVIDIA GPUs , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.
[121] Sharad Malik,et al. Power-driven design of router microarchitectures in on-chip networks , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..
[122] P. Cochat,et al. Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.
[123] Holger Fröning,et al. Exploring LLVM Infrastructure for Simplified Multi-GPU Programming , 2016 .
[124] Alyssa B. Apsel,et al. Leveraging Optical Technology in Future Bus-based Chip Multiprocessors , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[125] David Kaeli,et al. Heterogeneous Computing with OpenCL 2.0 , 2015 .
[126] Eisse Mensink,et al. A 0.28pJ/b 2Gb/s/ch Transceiver in 90nm CMOS for 10mm On-Chip interconnects , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.
[127] Mikko H. Lipasti,et al. Profiling Heterogeneous Multi-GPU Systems to Accelerate Cortically Inspired Learning Algorithms , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[128] Sudhakar Yalamanchili,et al. Design Space Exploration of On-chip Ring Interconnection for a CPU-GPU Architecture , 2012 .
[129] Sharad Malik,et al. A Power Model for Routers: Modeling Alpha 21364 and InfiniBand Routers , 2003, IEEE Micro.
[130] Ahmed Louri,et al. Dynamic Reconfiguration of 3D Photonic Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[131] V AdveSarita,et al. Weak orderinga new definition , 1990 .
[132] Stephen W. Keckler,et al. Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.
[133] Rajat Raina,et al. Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.
[134] Roy D. Sleator,et al. 'Big data', Hadoop and cloud computing in genomics , 2013, J. Biomed. Informatics.
[135] David R. Kaeli,et al. UMH , 2016, ACM Trans. Archit. Code Optim..
[136] Jaewon Lee,et al. ScaleGPU: GPU Architecture for Memory-Unaware GPU Programming , 2014, IEEE Computer Architecture Letters.
[137] Mahmut T. Kandemir,et al. Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[138] Chita R. Das,et al. A case for heterogeneous on-chip interconnects for CMPs , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[139] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[140] Michal Lipson,et al. Optical 4x4 hitless Silicon router for optical Networks-on-Chip (NoC): erratum , 2008 .
[141] Jung Ho Ahn,et al. Corona: System Implications of Emerging Nanophotonic Technology , 2008, 2008 International Symposium on Computer Architecture.
[142] Bingsheng He,et al. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..
[143] David R. Kaeli,et al. Data transformations enabling loop vectorization on multithreaded data parallel architectures , 2010, PPoPP '10.
[144] Hsien-Hsin S. Lee,et al. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.