HostoSink: A Collaborative Scheduling in Heterogeneous Environment

Due to the limitations of power consumption and memory capacity, the past few years have observed a strong trend of using heterogeneous environment equipped with accelerators, such as GPU (Graphic Processing Unit) and FPGA (Field Programmable Gate Array), and even MIC (Many Integrated Core), to help the traditional SMP (Symmetric Multi-Processing) CPU to speed up applications. In this paper, we choose the Intel MIC architecture coprocessor as the accelerator and design HostoSink, a runtime system for collaborative scheduling based on Pthread task. With the help of runtime characteristics of the application and the heterogeneous environment for scheduling the Pthread tasks between CPU and MIC automatically and dynamically, HostoSink provides MIC users with an easier way to gain high performance in heterogeneous CPU-MIC environment without the need of optimizing the original Pthread-based multi-threaded applications manually too much. Experimental results show that by using HostoSink, the overall speedup can achieve more than 3x speedup compared with the original performance by using CPU only and the average amount of data transmission between CPU and MIC is also reduced.

[1]  Christine A. Shoemaker,et al.  Scalable thread scheduling and global power management for heterogeneous many-core architectures , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  Wu-chun Feng,et al.  Transparent Accelerator Migration in a Virtualized GPU Environment , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[3]  Karsten Schwan,et al.  Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community , 2011, Computing in Science & Engineering.

[4]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[5]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[6]  Carlos Eduardo Pereira,et al.  An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms , 2011, 2011 IEEE International Conference on High Performance Computing and Communications.

[7]  Frank Vahid,et al.  Platune: a tuning framework for system-on-a-chip platforms , 2002, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[8]  Karsten Schwan,et al.  A framework for dynamically instrumenting GPU compute applications within GPU Ocelot , 2011, GPGPU-4.

[9]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[10]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS 2010.

[11]  Kunle Olukotun,et al.  OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning , 2011, ICML.

[12]  Giuseppe Scanniello,et al.  Using the GPU to Green an Intensive and Massive Computation System , 2013, 2013 17th European Conference on Software Maintenance and Reengineering.

[13]  David A. Patterson,et al.  RAMP gold: An FPGA-based architecture simulator for multiprocessors , 2010, Design Automation Conference.

[14]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[15]  Yongshi Jiang,et al.  Automatic Dynamic Task Distribution between CPU and GPU for VR Systems , 2012 .

[16]  Scott A. Mahlke,et al.  Bridging the computation gap between programmable processors and hardwired accelerators , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[17]  Alfonso Niño,et al.  A Survey of Parallel Programming Models and Tools in the Multi and Many-core Era , 2022 .

[18]  Michael Klemm,et al.  From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture , 2012, Computing in Science & Engineering.

[19]  M Mernik,et al.  When and how to develop domain-specific languages , 2005, CSUR.

[20]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[21]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[22]  Ümit V. Çatalyürek,et al.  An Early Evaluation of the Scalability of Graph Algorithms on the Intel MIC Architecture , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.