An Efficient Data-Distribution Mechanism in a Processor-In-Memory (PIM) Architecture Applied to Motion Estimation

In general, the main purpose of using processor-in-memory (PIM) modules is to dramatically increase the data-level parallelism (DLP) and avoid the limited issue rate of current systems (even when they include SIMD extensions) caused by the limited data bandwidth and functional units. Our approach is to divide the PIM module into hundreds of smaller pieces so that each of these smaller PIMs can execute motion estimation for a group of macro blocks in a parallel fashion. We also design the logic in each PIM to execute in a highly pipelined fashion so that even more parallelism can be exploited. The main contribution of this paper is the presentation of architectural techniques that can be used in the PIM module to overcome the addressing and data sharing overhead when these smaller PIMs are used. Our architectural techniques have been applied to motion estimation. Indeed, it has been reported that motion estimation takes the majority of the execution time of MPEG encoding and it has been researched by many because of its importance in MPEG encoding. With our paradigm and techniques, the host processor can be relieved from the most computationally demanding and data-intensive portions of the workload, which should therefore yield a significant performance gain. Indeed, we observed (when 512 of these smaller PIMs were used) a reduction in the number of memory accesses by a factor of up to 2,034 times. At the same time, the performance improved by a multiplicative factor as high as 439 times.

[1]  Christoforos E. Kozyrakis,et al.  A case for intelligent RAM , 1997, IEEE Micro.

[2]  Marc Tremblay,et al.  VIS speeds new media processing , 1996, IEEE Micro.

[3]  Mateo Valero,et al.  Exploiting a new level of DLP in multimedia applications , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[4]  Norman P. Jouppi,et al.  Performance of image and video processing with general-purpose processors and media ISA extensions , 1999, ISCA.

[5]  Stylianos Perissakis,et al.  The Energy Efficiency Of Iram Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[6]  Vladimir M. Pentkovski,et al.  Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[7]  Ming-Ting Sun,et al.  A flexible VLSI architecture for full-search block-matching motion-vector estimation , 1989, IEEE International Symposium on Circuits and Systems,.

[8]  Zhen Fang,et al.  MEPEG-4: fallacies and paradoxes , 2002, 2002 IEEE International Workshop on Workload Characterization.

[9]  K. Yelick,et al.  The Energy Efficiency Of Iram Architectures , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  Sven Bauer,et al.  The MPEG-4 video coding standard-a VLSI point of view , 1998, 1998 IEEE Workshop on Signal Processing Systems. SIPS 98. Design and Implementation (Cat. No.98TH8374).

[11]  Didier Le Gall,et al.  MPEG: a video compression standard for multimedia applications , 1991, CACM.

[12]  Deependra Talla Architectural techniques to accelerate multimedia applications on general-purpose processors , 2001 .

[13]  James R. Goodman,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[14]  Oscar C. Au,et al.  Predictive motion vector field adaptive search technique (PMVFAST): enhancing block-based motion estimation , 2000, IS&T/SPIE Electronic Imaging.

[15]  Noah Treuhaft,et al.  Intelligent RAM (IRAM): the industrial setting, applications, and architectures , 1997, Proceedings International Conference on Computer Design VLSI in Computers and Processors.

[16]  Rita Cucchiara,et al.  Performance analysis of MPEG-4 decoder and encoder , 2002, International Symposium on VIPromCom Video/Image Processing and Multimedia Communications.

[17]  Seung-Moon Yoo,et al.  FlexRAM: toward an advanced intelligent memory system , 1999, Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040).

[18]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[19]  M. Oskin,et al.  Active Pages: a computation model for intelligent memory , 1998, Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235).

[20]  Ming-Ting Sun,et al.  A family of vlsi designs for the motion compensation block-matching algorithm , 1989 .

[21]  Alan Jay Smith,et al.  Cache performance for multimedia applications , 2001, ICS '01.

[22]  Todd M. Austin,et al.  SimpleScalar: An Infrastructure for Computer System Modeling , 2002, Computer.

[23]  Doug Burger,et al.  Evaluating Future Microprocessors: the SimpleScalar Tool Set , 1996 .

[24]  Mateo Valero,et al.  Three-dimensional memory vectorization for high bandwidth media memory systems , 2002, MICRO.

[25]  Richard E. Matick,et al.  A 500MHz Random Cycle 1.5ns-Latency, SOI Embedded DRAM Macro Featuring a 3T Micro Sense Amplifier , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[26]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[27]  Ramesh Radhakrishnan,et al.  Evaluating MMX technology using DSP and multimedia applications , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[28]  Mateo Valero,et al.  DLP+TLP processors for the next generation of media workloads , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[29]  Jean-Luc Gaudiot,et al.  An efficient PIM (processor-in-memory) architecture for motion estimation , 2003, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003.

[30]  Kiyoo Itoh,et al.  Limitations and challenges of multigigabit DRAM chip design , 1997, IEEE J. Solid State Circuits.

[31]  James Abel,et al.  Applications Tuning for Streaming SIMD Extensions , 1999 .

[32]  Pradeep K. Dubey,et al.  How Multimedia Workloads Will Change Processor Design , 1997, Computer.

[33]  Jaewook Shin,et al.  Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[34]  K. Yelick,et al.  Intelligent RAM (IRAM): chips that remember and compute , 1997, 1997 IEEE International Solids-State Circuits Conference. Digest of Technical Papers.

[35]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[36]  Zhongli He,et al.  A high performance fast search algorithm for block matching motion estimation , 1997, IEEE Trans. Circuits Syst. Video Technol..