An adaptive cut-off for task parallelism

In task parallel languages, an important factor for achieving a good performance is the use of a cut-off technique to reduce the number of tasks created. Using a cut-off to avoid an excessive number of tasks helps the runtime system to reduce the total overhead associated with task creation, particularlt if the tasks are fine grain. Unfortunately, the best cut-off technique its usually dependent on the application structure or even the input data of the application. We propose a new cut-off technique that, using information from the application collected at runtime, decides which tasks should be pruned to improve the performance of the application. This technique does not rely on the programmer to determine the cut-off technique that is best suited for the application. We have implemented this cut-off in the context of the new OpenMP tasking model. Our evaluation, with a variety of applications, shows that our adaptive cut-off is able to make good decisions and most of the time matches the optimal cut-off that could be set by hand by a programmer.

[1]  Martin C. Rinard,et al.  Automatic parallelization of divide and conquer algorithms , 1999, PPoPP '99.

[2]  Hans-Wolfgang Loidl,et al.  On the Granularity of Divide-and-Conquer Parallelism , 1995, Functional Programming.

[3]  Alejandro Duran,et al.  Evaluation of OpenMP Task Scheduling Strategies , 2008, IWOMP.

[4]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, IEEE Trans. Parallel Distributed Syst..

[5]  Alejandro Duran,et al.  A Proposal for Task Parallelism in OpenMP , 2007, IWOMP.

[6]  Girija J. Narlikar,et al.  Scheduling Threads for Low Space Requirement and Good Locality , 1999, SPAA '99.

[7]  Alejandro Duran,et al.  Support for OpenMP tasks in Nanos v4 , 2007, CASCON.

[8]  Patrick C. Fischer,et al.  Efficient Procedures for Using Matrix Algorithms , 1974, ICALP.

[9]  Alejandro Duran,et al.  An Experimental Evaluation of the New OpenMP Tasking Model , 2007, LCPC.

[10]  Akinori Yonezawa,et al.  StackThreads/MP: integrating futures into calling standards , 1999, PPoPP '99.

[11]  Lutz Prechelt,et al.  Efficient Parallel Execution of Irregular Recursive Programs , 2002, IEEE Trans. Parallel Distributed Syst..

[12]  Seth Copen Goldstein,et al.  Lazy Threads: Implementing a Fast Parallel Call , 1996, J. Parallel Distributed Comput..

[13]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[14]  Martin Rinard,et al.  The design, implementation and evaluation of Jade: a portable, implicitly parallel programming language , 1994 .

[15]  Eduard Ayguadé,et al.  Nanos mercurium: A research compiler for OpenMP , 2004 .

[16]  Robert H. Halstead,et al.  Mul-T: a high-performance parallel Lisp , 1989, PLDI '89.

[17]  James R. Larus,et al.  Using the run-time sizes of data structures to guide parallel-thread creation , 1994, LFP '94.

[18]  Dror G. Feitelson,et al.  A run-time algorithm for managing the granularity of parallel functional programs , 1992, Journal of Functional Programming.

[19]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.