论文信息 - Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Valar: a benchmark suite to study the dynamic behavior of heterogeneous systems

Heterogeneous systems have grown in popularity within the commercial platform and application developer communities. We have seen a growing number of systems incorporating CPUs, Graphics Processors (GPUs) and Accelerated Processing Units (APUs combine a CPU and GPU on the same chip). These emerging class of platforms are now being targeted to accelerate applications where the host processor (typically a CPU) and compute device (typically a GPU) co-operate on a computation. In this scenario, the performance of the application is not only dependent on the processing power of the respective heterogeneous processors, but also on the efficient interaction and communication between them. To help architects and application developers to quantify many of the key aspects of heterogeneous execution, this paper presents a new set of benchmarks called the Valar. The Valar benchmarks are applications specifically chosen to study the dynamic behavior of OpenCL applications that will benefit from host-device interaction. We describe the general characteristics of our benchmarks, focusing on specific characteristics that can help characterize heterogeneous applications. For the purposes of this paper we focus on OpenCL as our programming environment, though we envision versions of Valar in additional heterogeneous programming languages. We profile the Valar benchmarks based on their mapping and execution on different heterogeneous systems. Our evaluation examines optimizations for host-device communication and the effects of closely-coupled execution of the benchmarks on the multiple OpenCL devices present in heterogeneous systems.

David R. Kaeli | Dana Schaa | Perhaad Mistry | Yash Ukidave

[1] Jiri Matas,et al. Online learning of robust object detectors during unstable tracking , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[2] Mike O'Connor,et al. Characterizing and evaluating a key-value store application on heterogeneous CPU-GPU systems , 2012, 2012 IEEE International Symposium on Performance Analysis of Systems & Software.

[3] Jan Vitek,et al. A family of real‐time Java benchmarks , 2011, Concurr. Comput. Pract. Exp..

[4] Babak Falsafi,et al. Clearing the clouds: a study of emerging scale-out workloads on modern hardware , 2012, ASPLOS XVII.

[5] Kushagra Vaid,et al. Web search using mobile cores: quantifying and mitigating the price of efficiency , 2010, ISCA.

[6] Kevin Skadron,et al. Accelerating SQL database operations on a GPU with CUDA , 2010, GPGPU-3.

[7] Dong Li,et al. The tradeoffs of fused memory hierarchies in heterogeneous computing architectures , 2012, CF '12.

[8] Karama Kanoun,et al. The Autonomic Computing Benchmark , 2008 .

[9] Naga K. Govindaraju,et al. Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10] Kim M. Hazelwood,et al. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[11] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[12] William Thies,et al. Teleport messaging for distributed stream programs , 2005, PPoPP.

[13] Kai Li,et al. Fidelity and scaling of the PARSEC benchmark inputs , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[14] Tony Lau,et al. THE AUTONOMIC COMPUTING BENCHMARK , 2008 .

[15] David R. Kaeli,et al. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[16] Wen-mei W. Hwu,et al. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors , 2012, PPoPP '12.

[17] John L. Henning. SPEC CPU2006 benchmark descriptions , 2006, CARN.

[18] Benedict R. Gaster,et al. Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck? , 2012, Computer.

[19] P. Hanrahan,et al. Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[20] Lieven Eeckhout,et al. Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[21] Wen-mei W. Hwu,et al. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[22] Brad Calder,et al. Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[23] Jeffrey S. Vetter,et al. Maestro: Data Orchestration and Tuning for OpenCL Devices , 2010, Euro-Par.

[24] David Kaeli,et al. Heterogeneous Computing with OpenCL , 2011 .

[25] Milind Kulkarni,et al. Towards architecture independent metrics for multicore performance analysis , 2011, PERV.

[26] Daisuke Takahashi,et al. The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[27] Kai Nagel,et al. Multi-agent traffic simulation with CUDA , 2009, 2009 International Conference on High Performance Computing & Simulation.

[28] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[29] Scott B. Baden,et al. Redefining the Role of the CPU in the Era of CPU-GPU Integration , 2012, IEEE Micro.

[30] David R. Kaeli,et al. Analyzing program flow within a many-kernel OpenCL application , 2011, GPGPU-4.

[31] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[32] William J. Dally,et al. Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[33] Kevin Skadron,et al. A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[34] Timothy G. Mattson,et al. OpenCL Programming Guide , 2011 .