GPU Computing Pipeline Inefficiencies and Optimization Opportunities in Heterogeneous CPU-GPU Processors
暂无分享,去创建一个
[1] Snehasish Kumar,et al. Fusion: Design tradeoffs in coherent cache hierarchies for accelerators , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[2] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[3] Martin Kraus,et al. Pyramid Methods in GPU-Based Image Processing , 2011 .
[4] Kevin Skadron,et al. Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).
[5] Albert G. Greenberg,et al. Data center TCP (DCTCP) , 2010, SIGCOMM '10.
[6] David A. Wood,et al. Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[7] Tor M. Aamodt,et al. Hybrid analytical modeling of pending cache hits, data prefetching, and MSHRs , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.
[8] Mark Silberstein,et al. PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.
[9] Sudhakar Yalamanchili,et al. Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[10] Minlan Yu,et al. Profiling Network Performance for Multi-tier Data Center Applications , 2011, NSDI.
[11] Sudhakar Yalamanchili,et al. Characterization and analysis of dynamic parallelism in unstructured GPU applications , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).
[12] Somayeh Sardashti,et al. The gem5 simulator , 2011, CARN.
[13] Michael I. Gordon,et al. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.
[14] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[15] Stephen W. Keckler,et al. Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.
[16] Keshav Pingali,et al. A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).
[17] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).
[18] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.
[19] William Thies,et al. StreamIt: A Language for Streaming Applications , 2002, CC.
[20] John Sell,et al. The Xbox One System on a Chip and Kinect Sensor , 2014, IEEE Micro.
[21] Franklin F. Kuo,et al. Proceedings of the ACM SIGCOMM conference on Communications architectures & protocols, 1986, Stowe, Vermont, United States, August 5-7, 1986 , 1986, SIGCOMM.
[22] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[23] David A. Wood,et al. A comparative analysis of microarchitecture effects on CPU and GPU memory system behavior , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).
[24] Yi Yang,et al. A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.
[25] Kim M. Hazelwood,et al. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.
[26] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[27] David A. Wood,et al. gem5-gpu: A Heterogeneous CPU-GPU Simulator , 2015, IEEE Computer Architecture Letters.
[28] Jianbin Fang,et al. A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.
[29] Thomas F. Wenisch,et al. Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[30] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .