Effective sampling-driven performance tools for GPU-accelerated supercomputers
暂无分享,去创建一个
[1] Wen-mei W. Hwu,et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.
[2] S. D. Hammond,et al. Performance Analysis of a Hybrid MPI / CUDA Implementation of the NAS-LU Benchmark , 2010 .
[3] Peng Wang,et al. Implementing molecular dynamics on hybrid high performance computers - short range forces , 2011, Comput. Phys. Commun..
[4] Nathan R. Tallent,et al. Analyzing lock contention in multithreaded applications , 2010, PPoPP '10.
[5] Hyesoon Kim,et al. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[6] Guido Juckeland,et al. Performance analysis of multi‐level parallelism: inter‐node, intra‐node and hardware accelerators , 2012, Concurr. Comput. Pract. Exp..
[7] James R. Larus,et al. Exploiting hardware performance counters with flow and context sensitive profiling , 1997, PLDI '97.
[8] Nathan R. Tallent,et al. HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..
[9] Steve Plimpton,et al. Fast parallel algorithms for short-range molecular dynamics , 1993 .
[10] Ian Karlin,et al. LULESH Programming Model and Performance Ports Overview , 2012 .
[11] Ian Karlin,et al. LULESH 2.0 Updates and Changes , 2013 .
[12] Jack J. Dongarra,et al. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.
[13] Nathan R. Tallent,et al. Binary analysis for measurement and attribution of program performance , 2009, PLDI '09.
[14] Allen D. Malony,et al. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs , 2011, 2011 International Conference on Parallel Processing.
[15] Richard W. Vuduc,et al. A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.
[16] Markus Geimer,et al. Identifying the Root Causes of Wait States in Large-Scale Parallel Applications , 2010, 2010 39th International Conference on Parallel Processing.
[17] F. Petrini,et al. The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[18] Stephen A. Jarvis,et al. Performance analysis of a hybrid MPI/CUDA implementation of the NASLU benchmark , 2011, PERV.
[19] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .
[20] Nathan R. Tallent,et al. Scalable fine-grained call path tracing , 2011, ICS '11.
[21] Karsten Schwan,et al. Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community , 2011, Computing in Science & Engineering.