ARTM: A lightweight fork-join framework for many-core embedded systems

Embedded architectures are moving to multi-core and many-core concepts in order to sustain ever growing computing requirements within complexity and power budgets. Programming many-core architectures not only needs parallel programming skills, but also efficient exploitation of fine grain parallelism at both architecture and runtime levels. Scheduler reactivity is however increasingly important as tasks granularity is reduced, in order to keep the overhead of the scheduling to a minimum. This paper presents a lightweight fork-join framework for scheduling fine grain parallel tasks on embedded many-core systems. The asynchronous nature of the fork-join model used in this framework permits to dramatically decrease its scheduling overhead. Experimentation conducted in this paper show that the overhead induced by this framework is of 33 cycles per scheduled task. Also, we show that near-ideal speedup can be obtained by the ARTM framework for data parallel applications and that ARTM achieves better results than other state of the art parallelization techniques.

[1]  Luca Benini,et al.  Fast and lightweight support for nested parallelism on cluster-based embedded many-cores , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Luca Benini,et al.  A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters , 2011, 2011 Design, Automation & Test in Europe.

[3]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[4]  Jean-Marc Philippe,et al.  An efficient and flexible hardware support for accelerating synchronization operations on the STHORM many-core architecture , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  David R. Butenhof Programming with POSIX threads , 1993 .

[6]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[7]  Luca Benini,et al.  P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8]  Chuck Pheatt,et al.  Intel® threading building blocks , 2008 .

[9]  Mitsuhisa Sato,et al.  OpenMP: parallel programming API for shared memory multiprocessors and on-chip multiprocessors , 2002, 15th International Symposium on System Synthesis, 2002..

[10]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[11]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[12]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[13]  Luca Benini,et al.  Synchronous Reactive Fine Grain Tasks Management for Homogeneous Many-Core Architectures , 2011, ARCS.