A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck

Memory latency has always been a major issue in embedded systems that execute memory-intensive applications. This is even more true as the gap between processor and memory speed continues to grow. Hardware and software prefetching have been shown to be effective in tolerating the large memory latencies inherit in large off-chip memories; however, both types of prefetching have their shortcomings. Hardware schemes are more complex and require extra circuitry to compute data access strides, while software schemes generate prefetch instructions, which if not computed carefully may hamper performance. On the other hand, some applications domains (such as multimedia) have a uniform and known a priori memory access pattern, that if exploited, could yield significant application performance improvement. With this characteristic in mind, we present our findings on hiding memory latency using the direct memory access (DMA) mode, which is present in all modern systems, combined with a software prefetch mechanism, and a customized on-chip memory hierarchy mapping. Compared to previous approaches, we are able to estimate the performance and power metrics, without actually implementing the embedded system. Experimental results on nine well known multimedia and imaging applications prove the efficiency of our technique. Finally, we verify the performance estimations by implementing and simulating the algorithms on the TI C6201 processor.

[1]  Gauthier Lafruit,et al.  The Local Wavelet Transform: a memory-efficient, high-speed architecture optimized to a Region-Oriented Zero-Tree coder , 2000, Integr. Comput. Aided Eng..

[2]  Jean-Loup Baer,et al.  An effective on-chip preloading scheme to reduce data access penalty , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[3]  Daniel A. Connors,et al.  Compiler-directed content-aware prefetching for dynamic data structures , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[4]  Pen-Chung Yew,et al.  : Data Prefetching In Shared Memory Multiprocessors , 1987, ICPP.

[5]  Hugo De Man,et al.  Platform Independent Data Transfer and Storage Exploration Illustrated on Parallel Cavity Detection Algorithm , 1999, PDPTA.

[6]  Luc Van Gool,et al.  One-shot active 3D shape acquisition , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[7]  Konstantinos Konstantinides,et al.  Image and video compression standards , 1995 .

[8]  Josep Torrellas,et al.  Improving the data cache performance of multiprocessor operating systems , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[9]  Erik Brockmeyer,et al.  Data reuse analysis technique for software-controlled memory hierarchies , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[10]  Henk Corporaal,et al.  Layer assignment techniques for low energy in multi-layered memory organisations , 2003 .

[11]  Francky Catthoor,et al.  Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design , 1998 .

[12]  Jason Fritts Multi-level memory prefetching for media and stream processing , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[13]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[14]  Alan Jay Smith,et al.  Sequential Program Prefetching in Memory Hierarchies , 1978, Computer.

[15]  Alexander V. Veidenbaum,et al.  An Integrated Hardware/Software Data Prefetching Scheme for Shared-Memory Multiprocessors1 , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[16]  Tien-Fu Chen,et al.  Alternative implementations of hybrid branch predictors , 1995, Proceedings of the 28th Annual International Symposium on Microarchitecture.

[17]  Kathryn S. McKinley,et al.  Guided region prefetching: a cooperative hardware/software approach , 2003, ISCA '03.

[18]  Hugo De Man,et al.  Minimizing the required memory bandwidth in VLSI system realizations , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[19]  David J. Sager,et al.  The microarchitecture of the Pentium 4 processor , 2001 .

[20]  Kenneth C. Yeager The Mips R10000 superscalar microprocessor , 1996, IEEE Micro.

[21]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[22]  Xiaotong Zhuang,et al.  A hardware-based cache pollution filtering mechanism for aggressive prefetches , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[23]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[24]  Todd C. Mowry,et al.  Tolerating latency through software-controlled data prefetching , 1994 .

[25]  Wei-Chung Hsu,et al.  Data Prefetching On The HP PA-8000 , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[26]  Derek Chiou,et al.  Scheduler-Based prefetching for Multilevel Memories , 2001 .

[27]  Frank Vahid,et al.  Prefetching for improved bus wrapper performance in cores , 2002, TODE.

[28]  Ken Kennedy,et al.  Software prefetching , 1991, ASPLOS IV.

[29]  Erik Brockmeyer,et al.  Layer assignment techniques for low power in multi-layered memory organisations. , 2003 .

[30]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[31]  Konstantinos Konstantinides,et al.  Image and Video Compression Standards: Algorithms and Architectures , 1997 .

[32]  Rita Cucchiara,et al.  Improving Data Prefetching Efficacy in Multimedia Applications , 2003, Multimedia Tools and Applications.

[33]  Th. Zahariadis,et al.  A spiral search algorithm for fast estimation of block motion vectors , 1996, 1996 8th European Signal Processing Conference (EUSIPCO 1996).

[34]  Chi-Keung Luk,et al.  Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors , 2001, Proceedings 28th Annual International Symposium on Computer Architecture.

[35]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[36]  Yale N. Patt,et al.  An effective programmable prefetch engine for on-chip caches , 1995, MICRO 1995.

[37]  Young Serk Shim,et al.  A fast hierarchical motion vector estimation algorithm using mean pyramid , 1995, IEEE Trans. Circuits Syst. Video Technol..