Opportunistic Computing in GPU Architectures
暂无分享,去创建一个
Mahmut T. Kandemir | Chita R. Das | Anand Sivasubramaniam | Asit K. Mishra | Adwait Jog | Ashutosh Pattnaik | Onur Kayiran | Xulong Tang | M. Kandemir | Adwait Jog | A. Sivasubramaniam | Onur Kayiran | C. Das | Ashutosh Pattnaik | Xulong Tang
[1] Indrani Paul,et al. Achieving Exascale Capabilities through Heterogeneous Computing , 2015, IEEE Micro.
[2] Mahmut T. Kandemir,et al. Race-To-Sleep + Content Caching + Display Caching: A Recipe for Energy-efficient Video Streaming on Handhelds , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[3] Wu-chun Feng,et al. MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL , 2016, Parallel Comput..
[4] Alexander Sprintson,et al. GCA: Global congestion awareness for load balance in Networks-on-Chip , 2013, 2013 Seventh IEEE/ACM International Symposium on Networks-on-Chip (NoCS).
[5] Onur Mutlu,et al. Accelerating Dependent Cache Misses with an Enhanced Memory Controller , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[6] Harold S. Stone,et al. A Logic-in-Memory Computer , 1970, IEEE Transactions on Computers.
[7] Nigel P. Topham,et al. Characterizing memory bottlenecks in GPGPU workloads , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).
[8] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[9] Mahmut T. Kandemir,et al. μC-States: Fine-grained GPU datapath power management , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[10] John Kim,et al. Throughput-Effective On-Chip Networks for Manycore Accelerators , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[11] Gabriel H. Loh,et al. 3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.
[12] Erik Brunvand,et al. Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[13] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[14] Onur Mutlu,et al. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[15] Mahmut T. Kandemir,et al. Data Movement Aware Computation Partitioning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[16] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).
[17] A. Gottleib,et al. The nyu ultracomputer- designing a mimd shared memory parallel computer , 1983 .
[18] Ramyad Hadidi,et al. CAIRO , 2017, ACM Trans. Archit. Code Optim..
[19] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[20] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.
[21] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[22] Mahmut T. Kandemir,et al. Scheduling techniques for GPU architectures with processing-in-memory capabilities , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).
[23] Jung Ho Ahn,et al. DRAMA: An Architecture for Accelerated Processing Near Memory , 2015, IEEE Computer Architecture Letters.
[24] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.
[25] Mahmut T. Kandemir,et al. Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[26] Peter M. Kogge,et al. EXECUBE-A New Architecture for Scaleable MPPs , 1994, 1994 International Conference on Parallel Processing Vol. 1.
[27] Chita R. Das,et al. A case for heterogeneous on-chip interconnects for CMPs , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[28] Mahmut T. Kandemir,et al. Anatomy of GPU Memory System for Multi-Application Execution , 2015, MEMSYS.
[29] Mahmut T. Kandemir,et al. Characterizing diverse handheld apps for customized hardware acceleration , 2017, 2017 IEEE International Symposium on Workload Characterization (IISWC).
[30] Mahmut T. Kandemir,et al. Addressing End-to-End Memory Access Latency in NoC-Based Multicores , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[31] Nan Jiang,et al. Packet chaining: Efficient single-cycle allocation for on-chip networks , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[32] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.
[33] Chita R. Das,et al. OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[34] Chun Chen,et al. The architecture of the DIVA processing-in-memory chip , 2002, ICS '02.
[35] George Kesidis,et al. Spock: Exploiting Serverless Functions for SLO and Cost Aware Resource Procurement in Public Cloud , 2019, 2019 IEEE 12th International Conference on Cloud Computing (CLOUD).
[36] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[37] Mahmut T. Kandemir,et al. Understanding Energy Efficiency in IoT App Executions , 2019, 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS).
[38] Jinchun Kim,et al. Bandwidth-efficient on-chip interconnect designs for GPGPUs , 2015, 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC).
[39] Xipeng Shen,et al. On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.
[40] Mahmut T. Kandemir,et al. Exploiting Core Criticality for Enhanced GPU Performance , 2016, SIGMETRICS.
[41] Ralph Grishman,et al. The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.
[42] Rachata Ausavarungnirun,et al. Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks , 2018, ASPLOS.
[43] Maya Gokhale,et al. Processing in Memory: The Terasys Massively Parallel PIM Array , 1995, Computer.
[44] Kyung Hoon Kim,et al. Packet coalescing exploiting data redundancy in GPGPU architectures , 2017, ICS.
[45] Mahmut T. Kandemir,et al. Quantifying Data Locality in Dynamic Parallelism in GPUs , 2018, Proc. ACM Meas. Anal. Comput. Syst..
[46] Mike Ignatowski,et al. TOP-PIM: throughput-oriented programmable processing in memory , 2014, HPDC '14.
[47] David R. Kaeli,et al. Asymmetric NoC Architectures for GPU Systems , 2015, NOCS.
[48] Bill Lin,et al. Destination-based adaptive routing on 2D mesh networks , 2010, 2010 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).
[49] G.J. Minden,et al. A survey of active network research , 1997, IEEE Communications Magazine.
[50] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).
[51] Ramyad Hadidi,et al. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[52] William J. Dally,et al. Route packets, not wires: on-chip inteconnection networks , 2001, DAC '01.
[53] Christoforos E. Kozyrakis,et al. A case for intelligent RAM , 1997, IEEE Micro.
[54] Chau-Wen Tseng,et al. Compiler optimizations for improving data locality , 1994, ASPLOS VI.
[55] Wu-chun Feng,et al. Measuring and modeling on-chip interconnect power on real hardware , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).
[56] Mike O'Connor,et al. Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[57] Sudhakar Yalamanchili,et al. Understanding Energy Aspects of Processing-near-Memory for HPC Workloads , 2015, MEMSYS.
[58] Wu-chun Feng,et al. Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL , 2015, 2015 IEEE International Conference on Cluster Computing.
[59] Mahmut T. Kandemir,et al. CritICs Critiquing Criticality in Mobile Apps , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[60] Norman P. Jouppi,et al. CACTI 6.0: A Tool to Model Large Caches , 2009 .
[61] Nigel P. Topham,et al. Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[62] Gabriel H. Loh Nuwan Jayasena Mark H. Oskin Mark Nutter Da Ignatowski. A Processing-in-Memory Taxonomy and a Case for Studying Fixed-function PIM , 2013 .
[63] Mahmut T. Kandemir,et al. Controlled Kernel Launch for Dynamic Parallelism in GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[64] Chen Sun,et al. DSENT - A Tool Connecting Emerging Photonics with Electronics for Opto-Electronic Networks-on-Chip Modeling , 2012, 2012 IEEE/ACM Sixth International Symposium on Networks-on-Chip.
[65] R. Govindarajan,et al. Taming warp divergence , 2017, 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[66] William J. Dally,et al. GPUs and the Future of Parallel Computing , 2011, IEEE Micro.
[67] Mahmut T. Kandemir,et al. Phoenix: A Constraint-Aware Scheduler for Heterogeneous Datacenters , 2017, 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS).