Optimal latency-throughput tradeoffs for data parallel pipelines

This paper addressesoptimal mapping of parallel programs composed of a chain of data parallel tasks onto the processors of a parallel system. The input to this class of programs is a stream of data sets, each of which is processed in order by the chain of tasks. This computation structure, also referrecl toasa data parallel pipeline, iscommon inseveralapplication domains including digital signal processing, image processing, and computer vision. Theparameters of the performance of stream processing are latency (the time to process an individual data set) and throughput (the aggregate rate at which the data sets are processed). These two criterion are distinct since multiple data sets can be pipelined or processed in parallel. We present anew algorithm to determine a processor mapping of a chain of tasks that optimizes the latency in the presence of throughput constraints, and chscuss optimization of the throughput with latency constraints. The problem formulation uses a general and realistic model of inter-task communication, and addresses the entree problem of mapping, which includes clustering tasks into modules, assignment of processors to modules, and possible replication of modules. The main algorithms are based on dynamic programming and their execution time complexity is polynomial in thenumber of processors and tasks. The entire framework is implemented as an automatic mapping tool in the Fx parallelizing compiler for a dialect of High Performance Fortran.

[1]  Jon A. Webb Latency and bandwidth considerations in parallel robotics image processing , 1993, Supercomputing '93. Proceedings.

[2]  Vivek Sarkar,et al.  Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[3]  Tao Yang,et al.  Scheduling and code generation for parallel architectures , 1993 .

[4]  James Demmel,et al.  Modeling the benefits of mixed data and task parallelism , 1995, SPAA '95.

[5]  David M. Nicol,et al.  Optimal Processor Assignment for a Class of Pipelined Computations , 1994, IEEE Trans. Parallel Distributed Syst..

[6]  J. A. Webb Latency and bandwidth considerations in parallel robotics image processing , 1993, Supercomputing '93.

[7]  Sachin S. Sapatnekar,et al.  A Convex Programming Approach for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[8]  Ken Kennedy,et al.  Integrated Support for Task and Data Parallelism , 1994, Int. J. High Perform. Comput. Appl..

[9]  Thomas R. Gross,et al.  Task Parallelism in a High Performance Fortran Framework , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[10]  Thomas R. Gross,et al.  Exploiting task and data parallelism on a multicomputer , 1993, PPOPP '93.

[11]  Shahid H. Bokhari,et al.  Assignment Problems in Parallel and Distributed Computing , 1987 .

[12]  Peter A. Dinda,et al.  Communication and memory requirements as the basis for mapping task and data parallel programs , 1994, Proceedings of Supercomputing '94.

[13]  Peter A. Dinda,et al.  The CMU task parallel program suite , 1994 .

[14]  Thomas R. Gross,et al.  Do&Merge: Integrating Parallel Loops and Reductions , 1993, LCPC.

[15]  Jaspal Subhlok,et al.  Optimal mapping of sequences of data parallel tasks , 1995, PPOPP '95.

[16]  Barbara M. Chapman,et al.  A Software Architecture for Multidisciplinary Applications: Integrating Task and Data Parallelism , 1994, CONPAR.

[17]  Ian Foster,et al.  A compilation system that integrates High Performance Fortran and Fortran M , 1994, Proceedings of IEEE Scalable High Performance Computing Conference.

[18]  Mark Crovella,et al.  The Advantages of Multiple Parallelizations in Combinatorial Search , 1994, J. Parallel Distributed Comput..