Automatic dataflow application tuning for heterogeneous systems

Due to the increasing prevalence of multicore microprocessors and accelerator technologies in modern supercomputer design, new techniques for designing scientific applications are needed, in order to efficiently leverage all of the power inherent in these systems. The dataflow programming paradigm is well-suited to application design for distributed and heterogeneous systems than other techniques. Traditionally in dataflow middleware, application data domains are statically partitioned and distributed among the processors using a demand-driven algorithm. Unfortunately, this task scheduling technique can cause severe load imbalances in heterogeneous environments. Furthermore, in the presence of different types of processors, the optimum datasize can be different for each processor type. To solve the load imbalance problem and to leverage the optimum datasize dynamicity in a dataflow framework, we present an algorithm which automatically partitions the application workspace. By putting this partitioning into the purview of the dataflow runtime system, we can adaptively change the size of databuffers and correctly balance the load. Experiments with four applications show that our technique allows developers to skip the tedious and error-prone step of manually tuning the data granularity. Our technique is always competitive with the best-known data partitioning for these experiments, and can beat it under certain constraints.

[1]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[2]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[3]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[4]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[5]  Jack B. Dennis,et al.  Data Flow Supercomputers , 1980, Computer.

[6]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[8]  Hiroshi Watanabe,et al.  Divisible Load Scheduling with Result Collection on Heterogeneous Systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  Karsten Schwan,et al.  ACDS: Adapting computational data streams for high performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[10]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[11]  Jun Kong,et al.  Computer-aided prognosis of neuroblastoma on whole-slide images: Classification of stromal development , 2009, Pattern Recognit..

[12]  Kevin Skadron,et al.  Experiences Accelerating MATLAB Systems Biology Applications , 2009 .

[13]  Jaspal Subhlok,et al.  Optimal latency-throughput tradeoffs for data parallel pipelines , 1996, SPAA '96.

[14]  Cynthia A. Phillips,et al.  Scheduling DAGs on asynchronous processors , 2007, SPAA '07.

[15]  Jack J. Dongarra,et al.  Decision Trees and MPI Collective Algorithm Selection Problem , 2007, Euro-Par.

[16]  Conor McBride Clowns to the left of me, jokers to the right (pearl): dissecting data structures , 2008, POPL '08.

[17]  Galen C. Hunt,et al.  The Coign automatic distributed partitioning system , 1999, OSDI '99.

[18]  Lúcia Maria de A. Drummond,et al.  Anthill: a scalable run-time environment for data mining applications , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[19]  Robert Strzodka,et al.  Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid , 2011, IEEE Transactions on Parallel and Distributed Systems.

[20]  Ümit V. Çatalyürek,et al.  Run-time optimizations for replicated dataflows on heterogeneous environments , 2010, HPDC '10.

[21]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[22]  Noah Treuhaft,et al.  Cluster I/O with River: making the fast case common , 1999, IOPADS '99.

[23]  Teresa H. Y. Meng,et al.  Merge: a programming model for heterogeneous multi-core systems , 2008, ASPLOS.

[24]  Yves Robert,et al.  Introduction to Scheduling , 2009, CRC computational science series.

[25]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[26]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[27]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[28]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[29]  Umakishore Ramachandran,et al.  Capsules: Expressing Composable Computations in a Parallel Programming Model , 2007, LCPC.

[30]  Ümit V. Çatalyürek,et al.  Investigating the use of GPU-accelerated nodes for SAR image formation , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[31]  Jorge J. Moré,et al.  Benchmarking optimization software with performance profiles , 2001, Math. Program..