An optimised dataflow engine for GPGPU stream processing

Stream processing applications have high-demanding performance requirements that are hard to tackle using traditional parallel models on modern many-core architectures, such as GPUs. On the other hand, recent dataflow computing models can naturally expose and facilitate the parallelism exploitation for a wide class of applications. Thus, instead of following the program order, different operations can be run in parallel as soon as their input operands become available. This work presents an extension to an existing dataflow library for Java. The library extension implements high-level constructs with multiple command queues to enable the superposition of memory operations and kernel executions on GPUs. Experimental results show that significant speedup can be achieved for a subset of well-known stream processing applications: Volume Ray-Casting, Path-Tracing and Sobel Filter. Moreover, new contributions in respect to concurrency analysis and the Stream processing parallel model in dataflow are presented.

[1]  P. Evripidou,et al.  FREDDO: an efficient Framework for Runtime Execution of Data-Driven Objects , 2017 .

[2]  Vítor Santos Costa,et al.  Couillard: Parallel programming via coarse-grained Data-flow Compilation , 2011, Parallel Comput..

[3]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[4]  Vítor Santos Costa,et al.  Trebuchet: exploring TLP with dataflow virtualisation , 2011, Int. J. High Perform. Syst. Archit..

[5]  Norman P. Jouppi,et al.  Readings in computer architecture , 2000 .

[6]  Weng-Fai Wong,et al.  StreamJIT , 2014, OOPSLA.

[7]  Alex Ramírez,et al.  Energy Efficient HPC on Embedded SoCs: Optimization Techniques for Mali GPU , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8]  Jiong Jin,et al.  Virtual Fog: A Virtualization Enabled Fog Computing Framework for Internet of Things , 2018, IEEE Internet of Things Journal.

[9]  Felipe Maia Galvão França,et al.  Dataflow Programming for Stream Processing , 2017, 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW).

[10]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[11]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .

[12]  Sandip Kundu,et al.  Concurrency Analysis in Dynamic Dataflow Graphs , 2018 .

[13]  Peter M. Athanas,et al.  FPGA-based HPC application design for non-experts , 2013, 2013 International Symposium on Rapid System Prototyping (RSP).

[14]  Daniel S. Katz,et al.  Swift/T: Large-Scale Application Composition via Distributed-Memory Dataflow Processing , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[15]  Andre Luiz Rocha Tupinamba,et al.  DistributedCL: A Framework for Transparent Distributed GPU Processing Using the OpenCL API , 2012, 2012 13th Symposium on Computer Systems.

[16]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[17]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[18]  Hadi Esmaeilzadeh,et al.  AxBench: A Benchmark Suite for Approximate Computing Across the System Stack , 2016 .

[19]  Cristiana Bentes,et al.  Towards a Dataflow Runtime Environment for Edge, Fog and In-Situ Computing , 2017, 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW).

[20]  Mung Chiang,et al.  Leveraging fog and cloud computing for efficient computational offloading , 2017, 2017 IEEE MIT Undergraduate Research Technology Conference (URTC).

[21]  Avi Mendelson,et al.  TERAFLUX: Harnessing dataflow in next generation teradevices , 2014, Microprocess. Microsystems.

[22]  Jóakim von Kistowski,et al.  SPEC CPU2017: Next-Generation Compute Benchmark , 2018, ICPE Companion.

[23]  Jack B. Dennis,et al.  A preliminary architecture for a basic data-flow processor , 1974, ISCA '75.

[24]  Brunno F. Goldstein,et al.  A Minimalistic Dataflow Programming Library for Python , 2014, 2014 International Symposium on Computer Architecture and High Performance Computing Workshop.

[25]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[26]  Daniel S. Katz,et al.  Swift: A language for distributed parallel scripting , 2011, Parallel Comput..

[27]  Carlos Reaño,et al.  A complete and efficient CUDA-sharing solution for HPC clusters , 2014, Parallel Comput..

[28]  Marco D. Santambrogio,et al.  A fog-computing architecture for preventive healthcare and assisted living in smart ambients , 2017, 2017 IEEE 3rd International Forum on Research and Technologies for Society and Industry (RTSI).

[29]  Tapani Ristaniemi,et al.  Multiobjective Optimization for Computation Offloading in Fog Computing , 2018, IEEE Internet of Things Journal.

[30]  Chia Yee Ooi,et al.  hpFog: A FPGA-Based Fog Computing Platform , 2017, 2017 International Conference on Networking, Architecture, and Storage (NAS).

[31]  Felipe Maia Galvão França,et al.  Task Scheduling in Sucuri Dataflow Library , 2016, 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW).