A distributed OpenCL framework using redundant computation and data replication

Applications written solely in OpenCL or CUDA cannot execute on a cluster as a whole. Most previous approaches that extend these programming models to clusters are based on a common idea: designating a centralized host node and coordinating the other nodes with the host for computation. However, the centralized host node is a serious performance bottleneck when the number of nodes is large. In this paper, we propose a scalable and distributed OpenCL framework called SnuCL-D for large-scale clusters. SnuCL-D's remote device virtualization provides an OpenCL application with an illusion that all compute devices in a cluster are confined in a single node. To reduce the amount of control-message and data communication between nodes, SnuCL-D replicates the OpenCL host program execution and data in each node. We also propose a new OpenCL host API function and a queueing optimization technique that significantly reduce the overhead incurred by the previous centralized approaches. To show the effectiveness of SnuCL-D, we evaluate SnuCL-D with a microbenchmark and eleven benchmark applications on a large-scale CPU cluster and a medium-scale GPU cluster.

[1]  Sushil Jajodia,et al.  An adaptive data replication algorithm , 1997, TODS.

[2]  Sandro Fiore,et al.  Towards Exascale Distributed Data Management , 2009, Int. J. High Perform. Comput. Appl..

[3]  Jaejin Lee,et al.  Hiding relaxed memory consistency with compilers , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[4]  Laxmikant V. Kalé,et al.  Programming heterogeneous clusters with accelerators using object-based programming , 2011, Sci. Program..

[5]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[6]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[7]  Tetsu Narumi,et al.  DS-CUDA: A Middleware to Use Many GPUs in the Cloud Environment , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[8]  Marek Olszewski,et al.  Kendo: efficient deterministic multithreading in software , 2009, ASPLOS.

[9]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[10]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[11]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[12]  Jungwon Kim,et al.  OpenCL as a Programming Model for GPU Clusters , 2011, LCPC.

[13]  Jack J. Dongarra,et al.  Exascale computing and big data , 2015, Commun. ACM.

[14]  Emery D. Berger,et al.  Dthreads: efficient deterministic multithreading , 2011, SOSP.

[15]  Ji Zhang,et al.  Optimizing the Java Piped I/O Stream Library for Performance , 2002, LCPC.

[16]  Carlos Reaño,et al.  CU2rCU: Towards the complete rCUDA remote GPU virtualization and sharing solution , 2012, 2012 19th International Conference on High Performance Computing.

[17]  Dan Grossman,et al.  CoreDet: a compiler and runtime system for deterministic multithreaded execution , 2010, ASPLOS 2010.

[18]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[19]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[20]  Ümit V. Çatalyürek,et al.  Improving performance of adaptive component-based dataflow middleware , 2012, Parallel Comput..

[21]  Takashi Nakamura,et al.  Hybrid OpenCL: Connecting Different OpenCL Implementations over Network , 2010, 2010 10th IEEE International Conference on Computer and Information Technology.

[22]  Jaejin Lee,et al.  FaCSim: a fast and cycle-accurate architecture simulator for embedded systems , 2008, LCTES '08.

[23]  Frederica Darema,et al.  A single-program-multiple-data computational model for EPEX/FORTRAN , 1988, Parallel Comput..

[24]  Thomas Fahringer,et al.  LibWater: heterogeneous distributed computing made easy , 2013, ICS '13.

[25]  Emery D. Berger,et al.  Grace: safe multithreaded programming for C/C++ , 2009, OOPSLA '09.

[26]  David A. Padua,et al.  Compiler techniques for high performance sequentially consistent java programs , 2005, PPOPP.

[27]  Alan L. Cox,et al.  TreadMarks: shared memory computing on networks of workstations , 1996 .

[28]  P. Sadayappan,et al.  High-performance code generation for stencil computations on GPU architectures , 2012, ICS '12.

[29]  Sergei Gorlatch,et al.  dOpenCL: Towards a Uniform Programming Approach for Distributed Heterogeneous Multi-/Many-Core Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[30]  Dan Grossman,et al.  CoreDet: a compiler and runtime system for deterministic multithreaded execution , 2010, ASPLOS XV.

[31]  Cédric Augonnet,et al.  StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators , 2012, EuroMPI.

[32]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[33]  Wen-mei W. Hwu,et al.  Program optimization carving for GPU computing , 2008, J. Parallel Distributed Comput..

[34]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[35]  Federico Silla,et al.  rCUDA: Reducing the number of GPU-based accelerators in high performance clusters , 2010, 2010 International Conference on High Performance Computing & Simulation.

[36]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[37]  Luís Paulo Santos,et al.  clOpenCL - Supporting Distributed Heterogeneous Computing in HPC Clusters , 2012, Euro-Par Workshops.

[38]  Amnon Barak,et al.  A package for OpenCL based heterogeneous computing on clusters with many GPU devices , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[39]  David A. Padua,et al.  Basic compiler algorithms for parallel programs , 1999, PPoPP '99.

[40]  Carlos Reaño,et al.  A complete and efficient CUDA-sharing solution for HPC clusters , 2014, Parallel Comput..

[41]  Dennis Shasha,et al.  Efficient and correct execution of parallel programs that share memory , 1988, TOPL.