Packet coalescing exploiting data redundancy in GPGPU architectures
暂无分享,去创建一个
Kyung Hoon Kim | Rahul Boyapati | Ki Hwan Yum | Yuho Jin | Jiayi Huang | Eun Jung Kim | Kyung Hoon Kim | Eun Jung Kim | Yuho Jin | K. H. Yum | Jiayi Huang | R. Boyapati
[1] William J. Dally,et al. Design tradeoffs for tiled CMP on-chip networks , 2006, ICS '06.
[2] Chen Sun,et al. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.
[3] Kevin Skadron,et al. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).
[4] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .
[5] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[6] Chung-Ta King,et al. Designing Coalescing Network-on-Chip for Efficient Memory Accesses of GPGPUs , 2014, NPC.
[7] Kevin Kai-Wei Chang,et al. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[8] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[9] Hyungjun Kim,et al. Recursive partitioning multicast: A bandwidth-efficient routing for Networks-on-Chip , 2009, 2009 3rd ACM/IEEE International Symposium on Networks-on-Chip.
[10] Bingsheng He,et al. Mars: Accelerating MapReduce with Graphics Processors , 2011, IEEE Transactions on Parallel and Distributed Systems.
[11] Margaret Martonosi,et al. MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[12] Natalie D. Enright Jerger,et al. Supporting efficient collective communication in NoCs , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[13] Kevin Skadron,et al. Performance modeling and automatic ghost zone optimization for iterative stencil loops on GPUs , 2009, ICS.
[14] Onur Mutlu,et al. A case for toggle-aware compression for GPU systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[15] Natalie D. Enright Jerger,et al. Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support , 2008, 2008 International Symposium on Computer Architecture.
[16] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[17] Dongdong Li,et al. Inter-Core Locality Aware Memory Scheduling , 2016, IEEE Computer Architecture Letters.
[18] Mahmut T. Kandemir,et al. A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[19] Mahmut T. Kandemir,et al. Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.
[20] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).
[21] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[22] Nam Sung Kim,et al. Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[23] David R. Kaeli,et al. Asymmetric NoC Architectures for GPU Systems , 2015, NOCS.
[24] John Kim,et al. Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[25] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[26] Nan Jiang,et al. A detailed and flexible cycle-accurate Network-on-Chip simulator , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[27] Florence March,et al. 2016 , 2016, Affair of the Heart.
[28] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[29] Scott A. Mahlke,et al. WarpPool: Sharing requests with inter-warp coalescing for throughput processors , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[30] Jinchun Kim,et al. Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).
[31] Natalie D. Enright Jerger,et al. Achieving predictable performance through better memory controller placement in many-core CMPs , 2009, ISCA '09.
[32] Li-Shiuan Peh,et al. Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[33] Gu-Yeon Wei,et al. Process Variation Tolerant 3T1D-Based Cache Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[34] Pramodita Sharma. 2012 , 2013, Les 25 ans de l’OMC: Une rétrospective en photos.