MultiMaKe: Chip-multiprocessor driven memory-aware kernel pipelining

The increasing demand for low-power and high-performance multimedia embedded systems has motivated the need for effective solutions to satisfy application bandwidth and latency requirements under a tight power budget. As technology scales, it is imperative that applications are optimized to take full advantage of the underlying resources and meet both power and performance requirements. We propose MultiMaKe, an application mapping design flow capable of discovering and enabling parallelism opportunities via code transformations, efficiently distributing the computational load across resources, and minimizing unnecessary data transfers. Our approach decomposes the application's tasks into smaller units of computations called kernels, which are distributed and pipelined across the different processing resources. We exploit the ideas of inter-kernel data reuse to minimize unnecessary data transfers between kernels, early execution edges to drive performance, and kernel pipelining to increase system throughput. Our experimental results on JPEG and JPEG2000 show up to 97% off-chip memory access reduction, and up to 80% execution time reduction over standard mapping and task-level pipelining approaches.

[1]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[2]  Nikil D. Dutt,et al.  A framework for memory-aware multimedia application mapping on chip-multiprocessors , 2008, 2008 IEEE/ACM/IFIP Workshop on Embedded Systems for Real-Time Multimedia.

[3]  Ranga Vemuri,et al.  RECOD: a retiming heuristic to optimize resource and memory utilization in HW/SW codesigns , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[4]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[6]  Peter Marwedel,et al.  Data partitioning for maximal scratchpad usage , 2003, ASP-DAC '03.

[7]  Paul M. Chau,et al.  Macro pipelining based scheduling on high performance heterogeneous multiprocessor systems , 1995, IEEE Trans. Signal Process..

[8]  Krzysztof Kuchcinski,et al.  A constructive algorithm for memory-aware task assignment and scheduling , 2001, CODES '01.

[9]  Jing-Chiou Liou,et al.  An Efficient Task Clustering Heuristic for Scheduling DAGs on Multiprocessors , 2007 .

[10]  Erik Brockmeyer,et al.  Data reuse analysis technique for software-controlled memory hierarchies , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[11]  Tulika Mitra,et al.  Integrated scratchpad memory optimization and task scheduling for MPSoC architectures , 2006, CASES '06.

[12]  Tao Yang,et al.  Clustering task graphs for message passing architectures , 1990, ICS '90.

[13]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[14]  Shuvra S. Bhattacharyya,et al.  The pipeline decomposition tree:: an analysis tool for multiprocessor implementation of image processing applications , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[15]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[16]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[17]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[18]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[19]  Jong-Hwan Kim,et al.  Quantum-inspired evolutionary algorithm for a class of combinatorial optimization , 2002, IEEE Trans. Evol. Comput..

[20]  Nikil D. Dutt,et al.  FORAY-GEN: automatic generation of affine functions for memory optimizations , 2005, Design, Automation and Test in Europe.

[21]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[22]  Yunheung Paek,et al.  Compiler driven data layout optimization for regular/irregular array access patterns , 2008, LCTES '08.

[23]  Erik Brockmeyer,et al.  Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[24]  Soonhoi Ha,et al.  Pipelined data parallel task mapping/scheduling technique for MPSoC , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[25]  Erik Brockmeyer,et al.  Layer assignment techniques for low energy in multi-layered memory organisations , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.

[26]  Sri Parameswaran,et al.  Design Methodology for Pipelined Heterogeneous Multiprocessor System , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[27]  Kurt Keutzer,et al.  Efficient Parallelization of H.264 Decoding with Macro Block Level Scheduling , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[28]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[29]  Vivek Sarkar,et al.  Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .

[30]  Nikil D. Dutt,et al.  Inter-kernel data reuse and pipelining on chip-multiprocessors for multimedia applications , 2009, 2009 IEEE/ACM/IFIP 7th Workshop on Embedded Systems for Real-Time Multimedia.

[31]  Hiroyuki Tomiyama,et al.  CHStone: A benchmark program suite for practical C-based high-level synthesis , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[32]  Sri Parameswaran,et al.  Heterogeneous multiprocessor implementations for JPEG:: a case study , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[33]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[34]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[35]  Daniel Gajski,et al.  Hardware/software partitioning and pipelining , 1997, DAC.

[36]  M. Cosnard,et al.  Clustering Task Graphs for Message Passing Architectures , 1990 .

[37]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[38]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[39]  Kiyoung Choi,et al.  SoCDAL: System-on-chip design AcceLerator , 2008, TODE.