Automatic Resource Scheduling with Latency Hiding for Parallel Stencil Applications on GPGPU Clusters

Overlapping computations and communication is a key to accelerating stencil applications on parallel computers, especially for GPU clusters. However, such programming is a time-consuming part of the stencil application development. To address this problem, we developed an automatic code generation tool to produce a parallel stencil application with latency hiding automatically from its dataflow model. With this tool, users visually construct the workflows of stencil applications in a dataflow programming model. Our dataflow compiler determines a data decomposition policy for each application, and generates source code that overlaps the stencil computations and communication (MPI and PCIe). We demonstrate two types of overlapping models, a CPU-GPU hybrid execution model and a GPU-only model. We use a CFD benchmark computing 19-point 3D stencils to evaluate our scheduling performance, which results in 1.45 TFLOPS in single precision on a cluster with 64 Tesla C1060 GPUs.

[1]  Mendel Rosenblum,et al.  Streamware: programming general-purpose multicore processors using streams , 2008, ASPLOS.

[2]  Patrick Crowley,et al.  Auto-pipe and the X language: a pipeline design tool and description language , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[3]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[4]  William J. Dally,et al.  Sequoia: Programming the Memory Hierarchy , 2006, International Conference on Software Composition.

[5]  Satoshi Matsuoka,et al.  GPU accelerated computing—from hype to mainstream, the rebirth of vector computing , 2009 .

[6]  Samuel Williams,et al.  An auto-tuning framework for parallel multicore stencil computations , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[7]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[8]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[9]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[10]  Ian T. Foster,et al.  Globus: a Metacomputing Infrastructure Toolkit , 1997, Int. J. High Perform. Comput. Appl..

[11]  Milind Girkar,et al.  EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system , 2007, PLDI '07.

[12]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[13]  Ryutaro Himeno,et al.  A parallel programming framework orchestrating multiple languages and architectures , 2011, CF '11.

[14]  Hans P. Zima,et al.  The cascade high productivity language , 2004 .

[15]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[16]  Pat Hanrahan,et al.  Brook for GPUs: stream computing on graphics hardware , 2004, SIGGRAPH 2004.

[17]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[18]  Umakishore Ramachandran,et al.  Streamline: a scheduling heuristic for streaming applications on the grid , 2006, Electronic Imaging.

[19]  Pradeep Dubey,et al.  3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Cheng Wu,et al.  An integrated resource management and scheduling system for grid data streaming applications , 2008, 2008 9th IEEE/ACM International Conference on Grid Computing.

[21]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[22]  Steven J. Deitz,et al.  Abstractions for dynamic data distribution , 2004 .