Gullfoss : Accelerating and Simplifying Data Movement among Heterogeneous Computing and Storage Resources
暂无分享,去创建一个
[1] Kevin Skadron,et al. A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..
[2] Trevor N. Mudge,et al. Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.
[3] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[4] Steven Swanson,et al. Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications , 2009, ASPLOS.
[5] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[6] Amar Phanishayee,et al. FAWN: a fast array of wimpy nodes , 2009, SOSP '09.
[7] Satoshi Matsuoka,et al. An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[8] Wu-chun Feng,et al. On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.
[9] John D. Owens,et al. Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[10] Sayantan Sur,et al. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.
[11] Kim M. Hazelwood,et al. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.
[12] Dhabaleswar K. Panda,et al. Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[13] Nagiza F. Samatova,et al. Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments , 2012, 2012 IEEE International Conference on Cluster Computing.
[14] Andrew S. Grimshaw,et al. Scalable GPU graph traversal , 2012, PPoPP '12.
[15] Alessandro Forin,et al. Direct GPU/FPGA communication Via PCI express , 2012, 2012 41st International Conference on Parallel Processing Workshops.
[16] Karthikeyan Sankaralingam,et al. Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.
[17] Scott B. Baden,et al. Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.
[18] Michael Bedford Taylor,et al. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.
[19] Yi Yang,et al. CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[20] Peter Benjamin Volk,et al. GPU join processing revisited , 2012, DaMoN '12.
[21] Frank Hady,et al. When poll is better than interrupt , 2012, FAST.
[22] Bingsheng He,et al. Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..
[23] Margaret Martonosi,et al. Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).
[24] Shinpei Kato,et al. Zero-copy I/O processing for low-latency GPU computing , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).
[25] Holger Fröning,et al. GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).
[26] David A. Wood,et al. Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[27] Duncan Poole,et al. Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. , 2013, Journal of chemical theory and computation.
[28] Massimo Bernaschi,et al. GPU Peer-to-Peer Techniques Applied to a Cluster Interconnect , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.
[29] Michael Stumm,et al. BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[30] Dhabaleswar K. Panda,et al. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.
[31] Myoungsoo Jung,et al. GPUdrive: Reconsidering Storage Accesses for GPU Acceleration , 2014 .
[32] Jack J. Dongarra,et al. A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems , 2015, Concurr. Comput. Pract. Exp..