Performance Debugging of GPGPU Applications with the Divergence Map
暂无分享,去创建一个
[1] William Gropp,et al. An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.
[2] Stijn Eyerman,et al. Per-thread cycle accounting in SMT processors , 2009, ASPLOS.
[3] Sudhakar Yalamanchili,et al. A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[4] Barton P. Miller,et al. The Paradyn Parallel Performance Measurement Tool , 1995, Computer.
[5] Edward T. Grochowski,et al. Larrabee: A many-Core x86 architecture for visual computing , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).
[6] Matthias Hauswirth,et al. Evaluating the accuracy of Java profilers , 2010, PLDI '10.
[7] Kai Li,et al. Performance measurements for multithreaded programs , 1998, SIGMETRICS '98/PERFORMANCE '98.
[8] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[9] Kun Zhou,et al. BSGP: bulk-synchronous GPU programming , 2008, SIGGRAPH 2008.
[10] Nathan R. Tallent,et al. Effective performance measurement and analysis of multithreaded applications , 2009, PPoPP '09.
[11] Kenneth E. Batcher,et al. Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.
[12] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.
[13] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .
[14] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[15] Susan L. Graham,et al. gprof: a call graph execution profiler (with retrospective) , 1982 .
[16] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[17] Avi Mendelson,et al. Programming model for a heterogeneous x86 platform , 2009, PLDI '09.
[18] Philippas Tsigas,et al. GPU-Quicksort: A practical Quicksort algorithm for graphics processors , 2010, JEAL.
[19] Oscar Naim,et al. Dynamic instrumentation of threaded applications , 1999, PPoPP '99.
[20] Michael Boyer. Automated Dynamic Analysis of CUDA Programs , 2008 .
[21] Fredrik Larsson,et al. Simics: A Full System Simulation Platform , 2002, Computer.