Energy-aware parallelization flow and toolset for C code

Multicore architectures are increasingly used in embedded systems to achieve higher throughput with lower energy consumption. This trend accentuates the need to convert existing sequential code to effectively exploit the resources of these architectures. We present a parallelization flow and toolset for legacy C code that includes a performance estimation tool, a parallelization tool, and a streaming-oriented parallelization framework. These are part of the work-in-progress EU FP7 PHARAON project that aims to develop a complete set of techniques and tools to guide and assist software development for heterogeneous parallel architectures. We demonstrate the effectiveness of the use of the toolset in an experiment where we measure the parallelization quality and time for inexperienced users, and the parallelization flow and performance results for the parallelization of a practical example of a stereo vision application.

[1]  Rainer Leupers,et al.  MAPS: An integrated framework for MPSoC application parallelization , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[2]  Amit Kumar Singh,et al.  Energy optimization by exploiting execution slacks in streaming applications on Multiprocessor Systems , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[3]  Jari Nurmi,et al.  CRISP: Cutting Edge Reconfigurable ICs for Stream Processing , 2011 .

[4]  Koen De Bosschere,et al.  The paralax infrastructure: Automatic parallelization with a helping hand , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[5]  Sungwon Kang,et al.  Transformation Rules for Synthesis of UML Activity Diagram from Scenario-Based Specification , 2010, 2010 IEEE 34th Annual Computer Software and Applications Conference.

[6]  Francisco Moya,et al.  A comprehensive integration infrastructure for embedded system design , 2012, Microprocess. Microsystems.

[7]  Gang Chen,et al.  Abstract: Energy optimization for real-time multiprocessor system-on-chip with optimal DVFS and DPM combination , 2013, ESTImedia.

[8]  David A. Patterson,et al.  Computer Architecture, Fifth Edition: A Quantitative Approach , 2011 .

[9]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[10]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[11]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[12]  Luciano Lavagno,et al.  HEAP: A Highly Efficient Adaptive Multi-processor Framework , 2012, DSD.

[13]  Sander Stuijk,et al.  MNEMEE: a framework for memory management and optimization of static and dynamic data in MPSoCs , 2010, CASES '10.

[14]  Klaus D. Müller-Glaser,et al.  MORPHEUS: Heterogeneous Reconfigurable Computing , 2007, 2007 International Conference on Field Programmable Logic and Applications.

[15]  E. Rijpkema,et al.  Compaan: deriving process networks from Matlab for embedded signal processing architectures , 2000, Proceedings of the Eighth International Workshop on Hardware/Software Codesign. CODES 2000 (IEEE Cat. No.00TH8518).

[16]  Luciano Lavagno,et al.  Dynamic Trace-Based Data Dependency Analysis for Parallelization of C Programs , 2012, 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation.

[17]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[18]  Tei-Wei Kuo,et al.  Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems , 2007, 2007 Asia and South Pacific Design Automation Conference.

[19]  Patrick P. C. Lee,et al.  A lock-free, cache-efficient shared ring buffer for multi-core architectures , 2009, ANCS '09.

[20]  Albert Cohen,et al.  OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs , 2012, TACO.

[21]  Bernard Goossens,et al.  Limits of Instruction-Level Parallelism Capture , 2013, ICCS.

[22]  Dongkun Shin,et al.  Power-aware scheduling of conditional task graphs in real-time multiprocessor systems , 2003, Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003. ISLPED '03..

[23]  Kurt Keutzer,et al.  The Concurrency Challenge , 2008, IEEE Design & Test of Computers.

[24]  Albert Cohen,et al.  Correct and efficient work-stealing for weak memory models , 2013, PPoPP '13.

[25]  Pierre Sens,et al.  Stream Processing of Healthcare Sensor Data: Studying User Traces to Identify Challenges from a Big Data Perspective , 2015, ANT/SEIT.

[26]  James Demmel,et al.  A view of the parallel computing landscape , 2009, CACM.

[27]  Yves Vanderperren,et al.  UML and model-driven development for SoC design , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[28]  Nectarios Koziris,et al.  Exploring the performance limits of simultaneous multithreading for memory intensive applications , 2008, The Journal of Supercomputing.

[29]  B. Ramakrishna Rau,et al.  PICO: Automatically Designing Custom Computers , 2002, Computer.

[30]  R. Govindarajan,et al.  Automatic compilation of MATLAB programs for synergistic execution on heterogeneous processors , 2011, PLDI '11.

[31]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.