论文信息 - Automatic Resource Scheduling with Latency Hiding for Parallel Stencil Applications on GPGPU Clusters

Automatic Resource Scheduling with Latency Hiding for Parallel Stencil Applications on GPGPU Clusters

Overlapping computations and communication is a key to accelerating stencil applications on parallel computers, especially for GPU clusters. However, such programming is a time-consuming part of the stencil application development. To address this problem, we developed an automatic code generation tool to produce a parallel stencil application with latency hiding automatically from its dataflow model. With this tool, users visually construct the workflows of stencil applications in a dataflow programming model. Our dataflow compiler determines a data decomposition policy for each application, and generates source code that overlaps the stencil computations and communication (MPI and PCIe). We demonstrate two types of overlapping models, a CPU-GPU hybrid execution model and a GPU-only model. We use a CFD benchmark computing 19-point 3D stencils to evaluate our scheduling performance, which results in 1.45 TFLOPS in single precision on a cluster with 64 Tesla C1060 GPUs.

[1] Mendel Rosenblum,et al. Streamware: programming general-purpose multicore processors using streams , 2008, ASPLOS.

[2] Patrick Crowley,et al. Auto-pipe and the X language: a pipeline design tool and description language , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[3] William Thies,et al. StreamIt: A Language for Streaming Applications , 2002, CC.

[4] William J. Dally,et al. Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[5] Satoshi Matsuoka,et al. GPU accelerated computing—from hype to mainstream, the rebirth of vector computing , 2009 .

[6] Samuel Williams,et al. An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[7] Teresa H. Y. Meng,et al. Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[8] Massimiliano Fatica,et al. Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[9] James Reinders,et al. Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[10] Ian T. Foster,et al. Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[11] Milind Girkar,et al. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system , 2007, PLDI '07.

[12] Philip S. Yu,et al. SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[13] Ryutaro Himeno,et al. A parallel programming framework orchestrating multiple languages and architectures , 2011, CF '11.

[14] Hans P. Zima,et al. The cascade high productivity language , 2004 .

[15] Bradley C. Kuszmaul,et al. Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[16] Pat Hanrahan,et al. Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[17] Victor Luchangco,et al. The Fortress Language Specification Version 1.0 , 2007 .

[18] Umakishore Ramachandran,et al. Streamline: a scheduling heuristic for streaming applications on the grid , 2006, Electronic Imaging.

[19] Pradeep Dubey,et al. 3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[20] Cheng Wu,et al. An integrated resource management and scheduling system for grid data streaming applications , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[21] Vivek Sarkar,et al. X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[22] Steven J. Deitz,et al. Abstractions for dynamic data distribution , 2004 .