Streaming Dynamic Coarse-Grained CPU/GPU Workloads with Heterogeneous Pipelines in FastFlow

Software pipelines permit the decomposition of a repetitive sequential process into a succession of distinguishable sub-processes called stages, each of which can be concurrently executed on a distinct processing element. This paper presents a heterogeneous streaming pipeline implementation using the FastFlow skeletal library for a numerical linear algebra code. By introducing minimal memory management, we implement a large-scale streaming application which allocates the different pipeline stages to multi-core CPU and multi-GPU resources in a cluster environment, demonstrating the suitability of the algorithmic skeleton approach to efficiently coordinate the pipeline operation. Our implementation shows that long- running heterogeneous pipelines can be effectively implemented in FastFlow.

[1]  Peter Kilpatrick,et al.  Accelerating Code on Multi-cores with FastFlow , 2011, Euro-Par.

[2]  Massimo Torquati,et al.  Efficient Smith-Waterman on Multi-core with FastFlow , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[3]  Manuel M. T. Chakravarty,et al.  Accelerating Haskell array codes with multicore GPUs , 2011, DAMP '11.

[4]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[5]  J. Krüger,et al.  Linear algebra operators for GPU implementation of numerical algorithms , 2003, ACM Trans. Graph..

[6]  Horacio González-Vélez,et al.  A survey of algorithmic skeleton frameworks: high‐level structured parallel programming enablers , 2010, Softw. Pract. Exp..

[7]  Horacio González-Vélez,et al.  Asymptotic Peak utilisation in Heterogeneous Parallel CPU/GPU Pipelines: a Decentralised Queue Monitoring Strategy , 2012, Parallel Process. Lett..

[8]  Jack J. Dongarra,et al.  Optimizing symmetric dense matrix-vector multiplication on GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Anne Benoit,et al.  Workload Balancing and Throughput Optimization for Heterogeneous Systems Subject to Failures , 2011, Euro-Par.

[10]  Murray Cole,et al.  Algorithmic Skeletons: Structured Management of Parallel Computation , 1989 .

[11]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[12]  Horacio González-Vélez,et al.  Adaptive structured parallelism for distributed heterogeneous architectures: a methodological approach with pipelines and farms , 2010, Concurr. Comput. Pract. Exp..

[13]  Horacio González-Vélez,et al.  Parallel Computational Modelling of Inelastic Neutron Scattering in Multi-node and Multi-core Architectures , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[14]  Francisco Almeida,et al.  Pipelines on heterogeneous systems: models and tools , 2005, Concurr. Pract. Exp..

[15]  Greg Humphreys,et al.  How GPUs Work , 2007, Computer.