Inter and intra kernel reuse analysis driven pipelining on Chip-Multiprocessors

As the demand for low power multimedia systems continues to grow, so will the need for low cost and efficient solutions. Driven by such need, as well as improvements IC design technology, Chip-Multiprocessors (CMPs) have emerged as a potential solution. CMPs offer flexibility, low cost, low power and the ability to handle highly parallel workloads. As CMPs scale, it is up to the designer to take full advantage of their computational resources and manage their constrained memory resources efficiently. In this paper we propose a methodology that enables designers to fully exploit the target platform's computational resources without sacrificing power consumption by maximizing the application's reuse. Our approach uses code transformations to split the application's tasks into smaller units of computations or subtasks called kernels. Each kernel is analyzed for inter and intra reuse opportunities in order to minimize unnecessary data transfers between kernels. Our approach also couples both scheduling/pipelining of tasks with their memory allocations. This allows us to obtain memory aware pipelined schedules that increases throughput and reduces power consumption. Our methodology has shown up to 15% performance improvements as well as 33% power reduction when compared to state of the art techniques.

[1]  Nikil D. Dutt,et al.  FORAY-GEN: automatic generation of affine functions for memory optimizations , 2005, Design, Automation and Test in Europe.

[2]  Ranga Vemuri,et al.  RECOD: a retiming heuristic to optimize resource and memory utilization in HW/SW codesigns , 1998, Proceedings of the Sixth International Workshop on Hardware/Software Codesign. (CODES/CASHE'98).

[3]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[4]  Jong-Hwan Kim,et al.  Quantum-inspired evolutionary algorithm for a class of combinatorial optimization , 2002, IEEE Trans. Evol. Comput..

[5]  Nikil D. Dutt,et al.  Efficient utilization of scratch-pad memory in embedded processor applications , 1997, Proceedings European Design and Test Conference. ED & TC 97.

[6]  Erik Brockmeyer,et al.  Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[7]  Peter Marwedel,et al.  Scratchpad memory: a design alternative for cache on-chip memory in embedded systems , 2002, Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627).

[8]  Kiyoung Choi,et al.  SoCDAL: System-on-chip design AcceLerator , 2008, TODE.

[9]  Krzysztof Kuchcinski,et al.  A constructive algorithm for memory-aware task assignment and scheduling , 2001, CODES '01.

[10]  Erik Brockmeyer,et al.  Data reuse analysis technique for software-controlled memory hierarchies , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[11]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[12]  Tulika Mitra,et al.  Integrated scratchpad memory optimization and task scheduling for MPSoC architectures , 2006, CASES '06.

[13]  Nikil D. Dutt,et al.  Inter-kernel data reuse and pipelining on chip-multiprocessors for multimedia applications , 2009, 2009 IEEE/ACM/IFIP 7th Workshop on Embedded Systems for Real-Time Multimedia.