Understanding the Future of Energy Efficiency in Multi-Module GPUs
暂无分享,去创建一个
Carole-Jean Wu | David W. Nellans | Akhil Arunkumar | Evgeny Bolotin | Carole-Jean Wu | E. Bolotin | A. Arunkumar | D. Nellans | Evgeny Bolotin
[1] Scott A. Mahlke,et al. APOGEE: Adaptive prefetching on GPUs for energy efficiency , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[2] K.M. Wilson,et al. Dynamic Page Placement to Improve Locality in CC-NUMA Multiprocessors for TPC-C , 2001, ACM/IEEE SC 2001 Conference (SC'01).
[3] Shuaiwen Song,et al. A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[4] Matthew Poremba,et al. Design and Analysis of an APU for Exascale Computing , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[5] Natalie D. Enright Jerger,et al. Enabling interposer-based disintegration of multi-core processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[6] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[7] Mahmut T. Kandemir,et al. μC-States: Fine-grained GPU datapath power management , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[8] Carole-Jean Wu,et al. MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[9] Karthikeyan Sankaralingam,et al. Your favorite simulator here " Considered Harmful , 2014 .
[10] Xi Chen,et al. A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications , 2013, IEEE Journal of Solid-State Circuits.
[11] Sudhakar Yalamanchili,et al. Power Modeling for GPU Architectures Using McPAT , 2014, TODE.
[12] Indrani Paul,et al. Dynamic GPGPU Power Management Using Adaptive Model Predictive Control , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[13] Kevin Skadron,et al. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).
[14] Joonyoung Kim,et al. HBM: Memory solution for bandwidth-hungry processors , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).
[15] Mattan Erez,et al. A locality-aware memory hierarchy for energy-efficient GPU architectures , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[16] Mahmut T. Kandemir,et al. A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[17] Amnon Barak,et al. Memory access patterns: the missing piece of the multi-GPU puzzle , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.
[18] Xuhao Chen,et al. Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[19] Daniel A. Jiménez,et al. Adaptive GPU cache bypassing , 2015, GPGPU@PPoPP.
[20] Nam Sung Kim,et al. Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[21] Mattan Erez,et al. Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures , 2016, ISCA.
[22] Mehrzad Samadi,et al. Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.
[23] David M. Brooks,et al. Energy characterization and instruction-level energy model of Intel's Xeon Phi processor , 2013, International Symposium on Low Power Electronics and Design (ISLPED).
[24] Onur Mutlu,et al. A case for toggle-aware compression for GPU systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[25] Long Chen,et al. Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[26] Gokcen Kestor,et al. Quantifying the energy cost of data movement in scientific applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).
[27] Long Chen,et al. Exploring Fine-Grained Task-Based Execution on Multi-GPU Systems , 2011, 2011 IEEE International Conference on Cluster Computing.
[28] John D. Owens,et al. Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[29] Vivien Quéma,et al. Traffic management: a holistic approach to memory placement on NUMA systems , 2013, ASPLOS '13.
[30] George Karypis,et al. Introduction to Parallel Computing , 1994 .
[31] Scott A. Mahlke,et al. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[32] Sudhakar Yalamanchili,et al. Coordinated energy management in heterogeneous processors , 2014, Sci. Program..
[33] William J. Dally,et al. Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[34] Carole-Jean Wu,et al. ID-cache: instruction and memory divergence based cache management for GPUs , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).
[35] Wu-chun Feng,et al. Measuring and modeling on-chip interconnect power on real hardware , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).
[36] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[37] Aamer Jaleel,et al. Beyond the Socket: NUMA-Aware GPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[38] Nuno Roma,et al. GPGPU Power Modeling for Multi-domain Voltage-Frequency Scaling , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[39] Mohammad Abdel-Majeed,et al. Warped gates: Gating aware scheduling and power gating for GPGPUs , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[40] G.E. Moore,et al. Cramming More Components Onto Integrated Circuits , 1998, Proceedings of the IEEE.
[41] Henry Hoffmann,et al. GRAPE: Minimizing energy for GPU applications with performance requirements , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[42] Carole-Jean Wu,et al. Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).
[43] Hiroshi Sasaki,et al. Power and Performance Characterization and Modeling of GPU-Accelerated Systems , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[44] Won Woo Ro,et al. Warped-Compression: Enabling power efficient GPUs through register compression , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[45] William J. Dally,et al. Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[46] Derek Chiou,et al. GPGPU performance and power estimation using machine learning , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[47] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.
[48] Anoop Gupta,et al. Scheduling and page migration for multiprocessor compute servers , 1994, ASPLOS VI.
[49] Carole-Jean Wu,et al. LATTE-CC: Latency Tolerance Aware Adaptive Cache Compression Management for Energy Efficient GPUs , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[50] Hideharu Amano,et al. Breadth First Search on Cost-efficient Multi-GPU Systems , 2016, CARN.
[51] Onur Mutlu,et al. The Locality Descriptor: A Holistic Cross-Layer Abstraction to Express Data Locality In GPUs , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[52] Hyesoon Kim,et al. An integrated GPU power and performance model , 2010, ISCA.
[53] Carole-Jean Wu,et al. Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).
[54] Thomas R. Gross,et al. Matching memory access patterns and data placement for NUMA systems , 2012, CGO '12.