Use of embedded DRAMs in video and image computing

Abstract We have evaluated the role of embedded dynamic random access memory (eDRAM) in the performance of programmable mediaprocessors, focusing on video/image computing. eDRAM’s contributions to improving the total system performance can be assessed by measuring the number of CPU stall cycles caused by the memory transactions. We decomposed the CPU stall cycles into three components: latency due to row access, latency due to the pipeline of memory transactions, and burst transfer time. We used a cycle-accurate cache and eDRAM model to measure the system performance in executing selected low-level video/image computing functions on a mediaprocessor core. We simulated various values for data bus width, page size, and row-access time of eDRAM, pipeline delay of a memory transaction, and data cache line size. While the wider data width of eDRAM does reduce the burst transfer time, the actual reduction in the total stall cycles when the width was expanded from 8 to 16 bytes was lower than expected, ranging from 6.2% to 18.9%. Instead, we found that the row-access latency and memory transaction pipeline delay represent the major portion of the CPU stall cycles. For example, in case of 32-byte wide data bus, they account for 85.3–95.1% of the memory busy time during which data cache misses are serviced. We show how to lower the CPU stall time further, e.g., using no-write-allocate data cache to reduce the total burst transfer time, efficient memory banking to reduce the number of eDRAM page misses, and various software/hardware methods to bring data to the cache before they are needed by the CPU. In particular, the regular memory access pattern in video/image computing allows several methods to enhance the memory performance in using eDRAM, e.g., enlarging the cache line size and data prefetching. This paper presents our methodology, experimental results, and findings, which would be useful to the design of highly integrated systems on a chip with eDRAM in the future.

[1]  Preeti Ranjan Panda,et al.  Memory bank customization and assignment in behavioral synthesis , 1999, 1999 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (Cat. No.99CH37051).

[2]  Todd C. Mowry,et al.  Compiler-based prefetching for recursive data structures , 1996, ASPLOS VII.

[3]  James R. Goodman,et al.  Limited bandwidth to affect processor design , 1997, IEEE Micro.

[4]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[5]  Trevor N. Mudge,et al.  A performance comparison of contemporary DRAM architectures , 1999, ISCA.

[6]  R. Schaller,et al.  Moore's law: past, present and future , 1997 .

[7]  Vicki H. Allan,et al.  Software pipelining , 1995, CSUR.

[8]  Chris Basoglu,et al.  The MAP1000A VLIW Mediaprocessor , 2000, IEEE Micro.

[9]  Yongmin Kim,et al.  Critical review of programmable media processor architectures , 1998, Electronic Imaging.

[10]  Yongmin Kim,et al.  Efficient 2D FFT implementation on mediaprocessors , 2003, Parallel Comput..

[11]  Steven Przybylski The performance impact of block sizes and fetch strategies , 1990, ISCA '90.

[12]  Norbert Wehn,et al.  Issues in embedded DRAM development and applications , 1998, Proceedings. 11th International Symposium on System Synthesis (Cat. No.98EX210).

[13]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[14]  Kunle Olukotun,et al.  The hierarchical multi-bank DRAM: a high-performance architecture for memory integrated with processors , 1997, Proceedings Seventeenth Conference on Advanced Research in VLSI.

[15]  Subramanian S. Iyer,et al.  Embedded DRAM technology: opportunities and challenges , 1999 .