Flexible software profiling of GPU architectures
暂无分享,去创建一个
David W. Nellans | Daniel R. Johnson | Stephen W. Keckler | Mark Stephenson | Mike O'Connor | Yunsup Lee | Eiman Ebrahimi | Siva Kumar Sastry Hari | S. Keckler | Mike O'Connor | Eiman Ebrahimi | Yunsup Lee | M. Stephenson | S. Hari | D. Nellans
[1] Satish Narayanasamy,et al. BugNet: continuously recording program execution for deterministic replay debugging , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).
[2] David Keppel,et al. Shade: a fast instruction-set simulator for execution profiling , 1994, SIGMETRICS.
[3] Onur Mutlu,et al. A case for bufferless routing in on-chip networks , 2009, ISCA '09.
[4] Nicholas Nethercote,et al. Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.
[5] Tao Li,et al. Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[6] Karsten Schwan,et al. A framework for dynamically instrumenting GPU compute applications within GPU Ocelot , 2011, GPGPU-4.
[7] Joel Emer,et al. SASSIFI : Evaluating Resilience of GPU Applications , 2015 .
[8] John L. Hennessy,et al. Multiprocessor Simulation and Tracing Using Tango , 1991, ICPP.
[9] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[10] Onur Mutlu,et al. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[11] Eric A. Brewer,et al. PROTEUS: a high-performance parallel-architecture simulator , 1992, SIGMETRICS '92/PERFORMANCE '92.
[12] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[13] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.
[14] Michael Garland,et al. Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .
[15] Yi Yang,et al. Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[16] Corina Sas,et al. Exploring the Design Space , 2006 .
[17] Keshav Pingali,et al. A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).
[18] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[19] Margaret Martonosi,et al. Dynamically exploiting narrow width operands to improve processor power and performance , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[20] Derek Bruening,et al. Efficient, transparent, and comprehensive runtime code manipulation , 2004 .
[21] Lixin Zhang,et al. Mambo: a full system simulator for the PowerPC architecture , 2004, PERV.
[22] George Kurian,et al. Graphite: A distributed parallel simulator for multicores , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.
[23] Bo Fang,et al. GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[24] Krste Asanovic,et al. Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[25] J. Robert Jump,et al. The rice parallel processing testbed , 1988, SIGMETRICS '88.
[26] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.
[27] Mike O'Connor,et al. Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[28] Rajiv Gupta,et al. Bitwidth aware global register allocation , 2003, POPL '03.
[29] Sandia Report,et al. Improving Performance via Mini-applications , 2009 .
[30] Krste Asanovic,et al. Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[31] John E. Stone,et al. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.
[32] Jeffrey Dean,et al. ProfileMe: hardware support for instruction-level profiling on out-of-order processors , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.
[33] John Sartori,et al. Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications , 2012, IEEE Transactions on Multimedia.
[34] Harish Patil,et al. Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.