Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment
暂无分享,去创建一个
Jack J. Dongarra | Stanimire Tomov | Piotr Luszczek | Azzam Haidar | Chongxiao Cao | Asim YarKhan | Khairul Kabir | J. Dongarra | P. Luszczek | A. Haidar | S. Tomov | A. YarKhan | K. Kabir | Chongxiao Cao
[1] Michael Garland,et al. Efficient Sparse Matrix-Vector Multiplication on CUDA , 2008 .
[2] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.
[3] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..
[4] P. Hanrahan,et al. Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[5] W. F. McColl,et al. Bulk synchronous parallel computing , 1995 .
[6] Jack J. Dongarra,et al. Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.
[7] Basilio B. Fraguela,et al. A framework for argument-based task synchronization with automatic detection of dependencies , 2013, Parallel Comput..
[8] Jack J. Dongarra,et al. Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi , 2013, PPAM.
[9] Robert A. van de Geijn,et al. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.
[10] Jack J. Dongarra,et al. An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..
[11] Jean-Philippe Martin,et al. Dandelion: a compiler and runtime for heterogeneous systems , 2013, SOSP.
[12] Jesús Labarta,et al. A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.
[13] Monica S. Lam,et al. Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.
[14] Jack Dongarra,et al. clMAGMA: high performance dense linear algebra with OpenCL , 2014, IWOCL '14.
[15] Asim YarKhan,et al. Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .
[16] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.
[17] Jack Dongarra,et al. QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .
[18] Jack Dongarra,et al. Multithreading in the PLASMA Library , 2014 .
[19] Ioana Burcea,et al. A compiler and runtime for heterogeneous computing , 2012, DAC Design Automation Conference 2012.
[20] Yi Guo,et al. The habanero multicore software research project , 2009, OOPSLA Companion.