Cache conscious data layout organization for embedded multimedia applications

Cache misses form a major bottleneck for real-time multimedia applications due to the off-chip accesses to the main memory. This results in both a major access bandwidth overhead (and related power consumption) as well as performance penalties. In this paper, we propose a new technique for organizing data in the main memory for data dominated multimedia applications so as to reduce majority of the conflict cache misses. The focus of this paper is on the formal and heuristic algorithms we use to steer the data layout decisions and the experimental results obtained using a prototype tool. Experiments on real-life demonstrators illustrate that we are able to reduce up to 82% of the conflict misses for applications that are already aggressively transformed at the source-level. At the same time, we also reduce the off-chip data accesses by up to 78% and combined with address optimizations we are able to reduce the execution time. Thus our approach is complimentary to the more conventional way of reducing misses by reorganizing the execution order.

[1]  Hugo De Man,et al.  Advanced Data Layout Optimization for Multimedia Applications , 2000, IPDPS Workshops.

[2]  Alexandru Nicolau,et al.  Memory Issues in Embedded Systems-on-Chip , 1999 .

[3]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988, Wiley interscience series in discrete mathematics and optimization.

[4]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[5]  D. Verkest,et al.  Systematic high-level address code transformations for piece-wise linear indexing: illustration on a medical imaging algorithm , 2000, 2000 IEEE Workshop on SiGNAL PROCESSING SYSTEMS. SiPS 2000. Design and Implementation (Cat. No.00TH8528).

[6]  Chidamber Kulkarni Cache optimization for multimedia applications , 2001 .

[7]  P. Boyle,et al.  A 300-MHz 115-W 32-b bipolar ECL microprocessor , 1993 .

[8]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[9]  Konstantinos Konstantinides,et al.  Image and Video Compression Standards: Algorithms and Architectures , 1997 .

[10]  Michael Stumm,et al.  Linear Loop Transformations in Optimising Compilers for Parallel Machines , 1995, Aust. Comput. J..

[11]  Mahmut T. Kandemir,et al.  Improving Cache Locality by a Combination of Loop and Data Transformation , 1999, IEEE Trans. Computers.

[12]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[13]  Benoît Meister,et al.  Automatic memory layout transformations to optimize spatial locality in parameterized loop nests , 2000, CARN.

[14]  Massimo Maresca,et al.  Image processing on high-performance RISC systems , 1996, Proc. IEEE.

[15]  J. Covino,et al.  A 2 ns zero wait state, 32 kB semi-associative L1 cache , 1996, 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC.

[16]  Francky Catthoor,et al.  Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design , 1998 .

[17]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating cache conflict misses , 1997 .

[18]  Hugo De Man,et al.  Code transformations for low power caching in embedded multimedia processors , 1998, Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing.

[19]  Tarek S. Abdelrahman,et al.  Array Data Layout for the Reduction of Cache Conflicts , 2001 .

[20]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.