HaoCL: Harnessing Large-scale Heterogeneous Processors Made Easy

The pervasive adoption of Deep Learning (DL) and Graph Processing (GP) makes it a de facto requirement to build large-scale clusters of heterogeneous accelerators including GPUs and FPGAs. The OpenCL programming framework can be used on the individual nodes of such clusters but is not intended for deployment in a distributed manner. Fortunately, the original OpenCL semantics naturally fit into the programming environment of heterogeneous clusters. In this paper, we propose a heterogeneity-aware OpenCL-like (HaoCL) programming framework to facilitate the programming of a wide range of scientific applications including DL and GP workloads on large-scale heterogeneous clusters. With HaoCL, existing applications can be directly deployed on heterogeneous clusters without any modifications to the original OpenCL source code and without awareness of the underlying hardware topologies and configurations. Our experiments show that HaoCL imposes a negligible overhead in a distributed environment, and provides near-linear speedups on standard benchmarks when computation or data size exceeds the capacity of a single node. The system design and the evaluations are presented in this demo paper.

[1]  Marianne Winslett,et al.  HaaS: Cloud-Based Real-Time Data Analytics with Heterogeneity-Aware Scheduling , 2018, 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS).

[2]  Jason Cong,et al.  FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs , 2009, 2009 IEEE 7th Symposium on Application Specific Processors.

[3]  Bingsheng He,et al.  On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[4]  Federico Silla,et al.  An Efficient Implementation of GPU Virtualization in High Performance Clusters , 2009, Euro-Par Workshops.

[5]  Jason Cong,et al.  LOPASS: A Low-Power Architectural Synthesis System for FPGAs With Interconnect Estimation and Optimization , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Amnon Barak,et al.  A package for OpenCL based heterogeneous computing on clusters with many GPU devices , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[7]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[8]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[9]  Yao Chen,et al.  Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs , 2019, FPGA.

[10]  Jason Cong,et al.  xPilot: A Platform-Based Behavioral Synthesis System , 2005 .

[11]  Jaehoon Jung,et al.  A distributed OpenCL framework using redundant computation and data replication , 2016, PLDI.

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).