HARS: A hardware-assisted runtime software for embedded many-core architectures

The current trend in embedded computing consists in increasing the number of processing resources on a chip. Following this paradigm, cluster-based many-core accelerators with a shared hierarchical memory have emerged. Handling synchronizations on these architectures is critical since parallel implementations speed-ups of embedded applications strongly depend on the ability to exploit the largest possible number of cores while limiting task management overhead. This article presents the combination of a low-overhead complete runtime software and a flexible hardware accelerator for synchronizations called HARS (Hardware-Assisted Runtime Software). Experiments on a multicore test chip showed that the hardware accelerator for synchronizations has less than 1p area overhead compared to a cluster of the chip while reducing synchronization latencies (up to 2.8 times compared to a test-and-set implementation) and contentions. The runtime software part offers basic features like memory management but also optimized execution engines to allow the easy and efficient extraction of the parallelism in applications with multiple programming models. By using the hardware acceleration as well as a very low overhead task scheduling software technique, we show that HARS outperforms an optimized state-of-the-art task scheduler by 13p for the execution of a parallel application.

[1]  Jean-Marc Philippe,et al.  An efficient and flexible hardware support for accelerating synchronization operations on the STHORM many-core architecture , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[2]  Guang R. Gao,et al.  Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures , 2007, ISCA '07.

[3]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[4]  Luca Benini,et al.  Fast and lightweight support for nested parallelism on cluster-based embedded many-cores , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  M. Raggio,et al.  VC-1 decoder on STMicroelectronics P2012 architecture , 2012 .

[6]  Michael L. Scott,et al.  Algorithms for scalable synchronization on shared-memory multiprocessors , 1991, TOCS.

[7]  Vincent J. Mooney,et al.  The System-on-a-Chip Lock Cache , 2004 .

[8]  Steven Swanson,et al.  The WaveScalar architecture , 2007, TOCS.

[9]  G. H. Barnes,et al.  A controllable MIMD architecture , 1986 .

[10]  A. Crespo,et al.  A hardware scheduler for complex real-time systems , 1999, ISIE '99. Proceedings of the IEEE International Symposium on Industrial Electronics (Cat. No.99TH8465).

[11]  Yves Lhuillier,et al.  ARTM: A lightweight fork-join framework for many-core embedded systems , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[12]  Gianluca Palermo,et al.  Efficient Synchronization for Embedded On-Chip Multiprocessors , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[13]  Anant Agarwal,et al.  Software Standards for the Multicore Era , 2009, IEEE Micro.

[14]  Bratin Saha,et al.  Enabling scalability and performance in a large scale CMP environment , 2007, EuroSys '07.

[15]  Yves Lhuillier Embedded System Memory Allocator Optimization Using Dynamic Code Generation , 2012 .

[16]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[17]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[18]  Luca Benini,et al.  A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters , 2011, 2011 Design, Automation & Test in Europe.

[19]  Vittorio Zaccaria,et al.  ARTE: An Application-specific Run-Time management framework for multi-core systems , 2011, 2011 IEEE 9th Symposium on Application Specific Processors (SASP).

[20]  Yang Zhang,et al.  Corey: An Operating System for Many Cores , 2008, OSDI.

[21]  Luca Benini,et al.  Platform 2012, a many-core computing accelerator for embedded SoCs: Performance evaluation of visual analytics applications , 2012, DAC Design Automation Conference 2012.

[22]  Luca Benini,et al.  Synchronous Reactive Fine Grain Tasks Management for Homogeneous Many-Core Architectures , 2011, ARCS.

[23]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[24]  Mateo Valero,et al.  Architectural Support for Fair Reader-Writer Locking , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[25]  Luca Benini,et al.  P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[26]  Chenjie Yu,et al.  Low-Cost and Energy-Efficient Distributed Synchronization for Embedded Multiprocessors , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[27]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[28]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[29]  Men-Chow Chiang,et al.  Memory system design for bus-based multiprocessors , 1992 .

[30]  Robert Tappan Morris,et al.  An Analysis of Linux Scalability to Many Cores , 2010, OSDI.

[31]  Joseph S. Sventek,et al.  Efficient dynamic heap allocation of scratch-pad memory , 2008, ISMM '08.