论文信息 - Applying Software-Managed Caching and CPU/GPU Task Scheduling for Accelerating Dynamic Workloads

Applying Software-Managed Caching and CPU/GPU Task Scheduling for Accelerating Dynamic Workloads

Publisher Summary This chapter covers two difficult problems frequently encountered by graphics processing unit (GPU) developers—optimizing memory access for kernels with complex input-dependent access patterns, and mapping the computations to a GPU or a CPU in composite applications with multiple dependent kernels. Both pose a formidable challenge, as they require dynamic adaptation and tuning of execution policies to allow high performance for a wide range of inputs. Not meeting these requirements leads to substantial performance penalty. This chapter describes the methodology for solving the memory optimization problem via softwaremanaged caching by efficiently exploiting the fast scratchpad memory. This technique outperforms the cache-less and the texture memory-based approaches on pre-Fermi GPU architectures as well as on the one that uses the Fermi hardware cache alone. It then presents the algorithm for minimizing the total running time of a complete application comprising multiple interdependent kernels. Both a GPU and a CPU can be used to execute the kernels, but the performance varies greatly for different inputs, calling for dynamic assignment of the computations to a GPU or a CPU at runtime. The communication overhead due to the data dependencies between the kernels makes per-kernel greedy selection of the best performing device suboptimal. The algorithm optimizes the runtime of the complete application by evaluating the performance of all the assignments jointly, including the overhead of the data transfers between the devices.

John D. Owens | Assaf Schuster | Mark Silberstein

[1] Dan Geiger,et al. Exact genetic linkage computations for general pedigrees , 2002, ISMB.

[2] Anjul Patney,et al. Efficient computation of sum-products on GPUs through software-managed cache , 2008, ICS '08.

[3] M Silberstein,et al. Online system for faster multipoint linkage analysis via parallel execution on thousands of personal computers. , 2006, American journal of human genetics.

[4] Ishfaq Ahmad,et al. Benchmarking and Comparison of the Task Graph Scheduling Algorithms , 1999, J. Parallel Distributed Comput..

[5] Payam Pakzad,et al. A new look at the generalized distributive law , 2004, IEEE Transactions on Information Theory.