Inter-kernel data reuse and pipelining on chip-multiprocessors for multimedia applications

The increasing demand for low power and high performance multimedia embedded systems has motivatedation bandwidth and latency requirements under a tight power budge the need for effective solutions to satisfy applict. As technology scales, it is imperative that applications are optimized to take full advantage of the underlying resources and meet both power and performance requirements. We propose a methodology capable of discovering and enabling parallelism opportunities via code transformations, efficiently distributing the computational load across resources, and minimizing unnecessary data transfers. Our approach decomposes the application's tasks into smaller units of computations called kernels, which are distributed and pipelined across the different processing resources. We exploit the ideas of inter-kernel data reuse to minimize unnecessary data transfers between kernels and early execution edges to drive performance. Our experimental results on a JPEG2000 case study show up to 80% performance improvement and 60% dynamic power reduction over standard application mapping approaches.

[1]  Krzysztof Kuchcinski,et al.  A constructive algorithm for memory-aware task assignment and scheduling , 2001, CODES '01.

[2]  Shuvra S. Bhattacharyya,et al.  The pipeline decomposition tree:: an analysis tool for multiprocessor implementation of image processing applications , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[3]  Sri Parameswaran,et al.  Design Methodology for Pipelined Heterogeneous Multiprocessor System , 2007, 2007 44th ACM/IEEE Design Automation Conference.

[4]  D. Gajski,et al.  Hardware/software Partitioning And Pipelining , 1997, Proceedings of the 34th Design Automation Conference.

[5]  Kurt Keutzer,et al.  Efficient Parallelization of H.264 Decoding with Macro Block Level Scheduling , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[6]  Nikil D. Dutt,et al.  A framework for memory-aware multimedia application mapping on chip-multiprocessors , 2008, 2008 IEEE/ACM/IFIP Workshop on Embedded Systems for Real-Time Multimedia.

[7]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[8]  Mahmut T. Kandemir,et al.  Dynamic management of scratch-pad memory space , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[9]  Peter Marwedel,et al.  Data partitioning for maximal scratchpad usage , 2003, ASP-DAC '03.

[10]  Sri Parameswaran,et al.  Heterogeneous multiprocessor implementations for JPEG:: a case study , 2006, Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS '06).

[11]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[12]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[13]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[14]  Michael I. Gordon,et al.  Exploiting coarse-grained task, data, and pipeline parallelism in stream programs , 2006, ASPLOS XII.

[15]  Yunheung Paek,et al.  Compiler driven data layout optimization for regular/irregular array access patterns , 2008, LCTES '08.

[16]  B. Ramakrishna Rau,et al.  Iterative modulo scheduling: an algorithm for software pipelining loops , 1994, MICRO 27.

[17]  Erik Brockmeyer,et al.  Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[18]  Soonhoi Ha,et al.  Pipelined data parallel task mapping/scheduling technique for MPSoC , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[19]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[20]  Paul M. Chau,et al.  Macro pipelining based scheduling on high performance heterogeneous multiprocessor systems , 1995, IEEE Trans. Signal Process..

[21]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[22]  Jong-Hwan Kim,et al.  Quantum-inspired evolutionary algorithm for a class of combinatorial optimization , 2002, IEEE Trans. Evol. Comput..

[23]  Jingling Xue,et al.  Loop Tiling for Parallelism , 2000, Kluwer International Series in Engineering and Computer Science.

[24]  Scott A. Mahlke,et al.  Uncovering hidden loop level parallelism in sequential applications , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[25]  Vivek Sarkar,et al.  Partitioning and scheduling parallel programs for execution on multiprocessors , 1987 .

[26]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[27]  Christian Steger,et al.  Rapid exploration of multimedia system-on-chips with automatically generated software performance models , 2008, 2008 IEEE/ACM/IFIP Workshop on Embedded Systems for Real-Time Multimedia.

[28]  Kiyoung Choi,et al.  SoCDAL: System-on-chip design AcceLerator , 2008, TODE.

[29]  Erik Brockmeyer,et al.  Layer assignment techniques for low power in multi-layered memory organisations. , 2003 .

[30]  Erik Brockmeyer,et al.  Data reuse analysis technique for software-controlled memory hierarchies , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[31]  Tulika Mitra,et al.  Integrated scratchpad memory optimization and task scheduling for MPSoC architectures , 2006, CASES '06.

[32]  Ranga Vemuri,et al.  RECOD: a retiming heuristic to optimize resource and memory utilization in HW/SW codesigns , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[33]  Nikil D. Dutt,et al.  FORAY-GEN: automatic generation of affine functions for memory optimizations , 2005, Design, Automation and Test in Europe.

[34]  Erik Brockmeyer,et al.  Layer assignment techniques for low energy in multi-layered memory organisations , 2003, 2003 Design, Automation and Test in Europe Conference and Exhibition.