Fine-Grained Synchronizations and Dataflow Programming on GPUs
暂无分享,去创建一个
[1] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[2] Jie Cheng,et al. CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..
[3] Henry G. Dietz,et al. Dynamic Barrier Architecture for Multi-Mode Fine-Grain Parallelism Using Conventional Processors , 1994, 1994 International Conference on Parallel Processing Vol. 1.
[4] Thomas E. Anderson,et al. The performance implications of thread management alternatives for shared-memory multiprocessors , 1989, SIGMETRICS '89.
[5] Keshav Pingali,et al. Atomic-free irregular computations on GPUs , 2013, GPGPU@ASPLOS.
[6] David A. Padua,et al. Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.
[7] Wu-chun Feng,et al. Accelerating Data-Serial Applications on Data-Parallel GPGPUs: A Systems Approach , 2008 .
[8] Leslie Lamport,et al. The parallel execution of DO loops , 1974, CACM.
[9] Edson Cáceres,et al. A Parallel Wavefront Algorithm for Efficient Biological Sequence Comparison , 2003, ICCSA.
[10] Stephen A. Jarvis,et al. Parallelising wavefront applications on general-purpose GPU devices , 2010 .
[11] Brucek Khailany,et al. CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[12] Jack J. Purdum,et al. C programming guide , 1983 .
[13] Guang R. Gao,et al. Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.
[14] John D. Owens,et al. Efficient Synchronization Primitives for GPUs , 2011, ArXiv.
[15] アール. ニコルス ジョン,et al. Lock mechanism that enables atomic updates to shared memory , 2009 .
[16] Jun Kong,et al. Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines , 2013, Parallel Comput..
[17] Gadi Taubenfeld. Synchronization Algorithms and Concurrent Programming , 2006 .
[18] Meng-Lai Yin,et al. A parallel implementation of the Smith-Waterman algorithm for massive sequences searching , 2004, The 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society.
[19] Pen-Chung Yew,et al. The impact of synchronization and granularity on parallel systems , 1990, ISCA '90.
[20] Alexandru Nicolau,et al. Techniques for efficient placement of synchronization primitives , 2009, PPoPP '09.
[21] Dean M. Tullsen,et al. Supporting fine-grained synchronization on a simultaneous multithreading processor , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[22] Alexander Aiken,et al. Singe: leveraging warp specialization for high performance on GPUs , 2014, PPoPP '14.
[23] Feng Ji,et al. Using Shared Memory to Accelerate MapReduce on Graphics Processing Units , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[24] John D. Owens,et al. GPU Computing , 2008, Proceedings of the IEEE.
[25] Wen-mei W. Hwu,et al. GPU Computing Gems Emerald Edition , 2011 .
[26] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[27] Thomas E. Anderson,et al. The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors , 1990, IEEE Trans. Parallel Distributed Syst..
[28] José Ignacio Benavides Benítez,et al. Performance Modeling of Atomic Additions on GPU Scratchpad Memory , 2013, IEEE Transactions on Parallel and Distributed Systems.