Gullfoss : Accelerating and Simplifying Data Movement among Heterogeneous Computing and Storage Resources

High-end computer systems increasingly rely on heterogeneous computing resources. For instance, a datacenter server might include multiple CPUs, high-end GPUs, PCIe SSDs, and high-speed networking interface cards. All of these components provide computing resources and operate at a high bandwidth. Coordinating the movement of data and scheduling computation across these resources is a complex task, as current programming models require system developers to explicitly schedule data transfers. Moving data is also inefficient in terms of both performance and energy costs: some applications running on GPU-equipped systems spend over 55% of their execution time and 53% of energy moving data between the storage device and the GPU. This paper proposes Gullfoss, a system that provides a simplified programming model for these heterogeneous computing systems. Gullfoss provides a high-level interface for specifying an application’s data movement requirements, and dynamically schedules data transfers while accounting for current system load and program requirements. Our initial implementation of Gullfoss focuses on data transfers between an SSD and a GPU, eliminating wasteful transfers to and from main memory as data moves between the two. This saves memory energy and bandwidth, leaving the CPU free to do useful work or operate at a lower frequency to improve energy efficiency. We implement and evaluate Gullfoss using commercially available hardware components. Gullfoss achieves 1.46× speedup, reduces energy consumption by 28%, and improves energy-delay product by 41%, comparing with systems without Gullfoss. For multi-program workloads, Gullfoss shows 1.5× speedup. Gullfoss also improves the performance of a GPU-based MapReduce framework by 10%.

[1]  Kevin Skadron,et al.  A performance study of general-purpose applications on graphics processors using CUDA , 2008, J. Parallel Distributed Comput..

[2]  Trevor N. Mudge,et al.  Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.

[3]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[4]  Steven Swanson,et al.  Gordon: using flash memory to build fast, power-efficient clusters for data-intensive applications , 2009, ASPLOS.

[5]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[6]  Amar Phanishayee,et al.  FAWN: a fast array of wimpy nodes , 2009, SOSP '09.

[7]  Satoshi Matsuoka,et al.  An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Wu-chun Feng,et al.  On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[9]  John D. Owens,et al.  Multi-GPU MapReduce on GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[11]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[12]  Dhabaleswar K. Panda,et al.  Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[13]  Nagiza F. Samatova,et al.  Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments , 2012, 2012 IEEE International Conference on Cluster Computing.

[14]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[15]  Alessandro Forin,et al.  Direct GPU/FPGA communication Via PCI express , 2012, 2012 41st International Conference on Parallel Processing Workshops.

[16]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[17]  Scott B. Baden,et al.  Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.

[18]  Michael Bedford Taylor,et al.  Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse , 2012, DAC Design Automation Conference 2012.

[19]  Yi Yang,et al.  CPU-assisted GPGPU on fused CPU-GPU architectures , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[20]  Peter Benjamin Volk,et al.  GPU join processing revisited , 2012, DaMoN '12.

[21]  Frank Hady,et al.  When poll is better than interrupt , 2012, FAST.

[22]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[23]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[24]  Shinpei Kato,et al.  Zero-copy I/O processing for low-latency GPU computing , 2013, 2013 ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).

[25]  Holger Fröning,et al.  GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[26]  David A. Wood,et al.  Heterogeneous system coherence for integrated CPU-GPU systems , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Duncan Poole,et al.  Routine Microsecond Molecular Dynamics Simulations with AMBER on GPUs. 2. Explicit Solvent Particle Mesh Ewald. , 2013, Journal of chemical theory and computation.

[28]  Massimo Bernaschi,et al.  GPU Peer-to-Peer Techniques Applied to a Cluster Interconnect , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[29]  Michael Stumm,et al.  BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[30]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[31]  Myoungsoo Jung,et al.  GPUdrive: Reconsidering Storage Accesses for GPU Acceleration , 2014 .

[32]  Jack J. Dongarra,et al.  A scalable approach to solving dense linear algebra problems on hybrid CPU‐GPU systems , 2015, Concurr. Comput. Pract. Exp..