Performance Evaluation and Verification of MMX-type Instructions on an Embedded Parallel Processor

This paper introduces an SIMD(Single Instruction Multiple Data) based parallel processor that efficiently processes massive data inherent in multimedia. In addition, this paper implements MMX(MultiMedia eXtension)-type instructions on the data parallel processor and evaluates and analyzes the performance of the MMX-type instructions. The reference data parallel processor consists of 16 processors each of which has a 32-bit datapath. Experimental results for a JPEG compression application with a 1280x1024 pixel image indicate that MMX-type instructions achieves a 50% performance improvement over the baseline instructions on the same data parallel architecture. In addition, MMX-type instructions achieves 100% and 51% improvements over the baseline instructions in energy efficiency and area efficiency, respectively. These results demonstrate that multimedia specific instructions including MMX-type have potentials for widely used many-core GPU(Graphics Processing Unit) and any types of parallel processors.

[1]  David L. Bean,et al.  A programmable processor with 4096 processing units for media applications , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[2]  Alan Jay Smith,et al.  Measuring the Performance of Multimedia Instruction Sets , 2002, IEEE Trans. Computers.

[3]  Jong-Myon Kim,et al.  A Massively Parallel Algorithm for Fuzzy Vector Quantization , 2009 .

[4]  Per Stenström,et al.  Reducing Contention in Sharde-Memory Multiprocessors , 1988, Computer.

[5]  Antonio Gentile,et al.  Portable video supercomputing , 2004, IEEE Transactions on Computers.

[6]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[7]  Ramesh Radhakrishnan,et al.  Evaluating MMX technology using DSP and multimedia applications , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[8]  Norman P. Jouppi,et al.  Performance of image and video processing with general-purpose processors and media ISA extensions , 1999, ISCA.

[9]  Long-Wen Chang,et al.  Designing JPEG quantization tables based on human visual system , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[10]  Stanley Mazor,et al.  The history of the 4004 , 1996, IEEE Micro.

[11]  L. W. Tucker,et al.  Architecture and applications of the Connection Machine , 1988, Computer.

[12]  Eric Rice,et al.  The UCSC Kestrel parallel processor , 2005, IEEE Transactions on Parallel and Distributed Systems.

[13]  Tarek M. Taha,et al.  Heterogeneous architecture models for interconnect-motivated system design , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[14]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1992 .

[15]  James D. Meindl,et al.  A generic system simulator (GENESYS) for ASIC technology and architecture beyond 2001 , 1996, Proceedings Ninth Annual IEEE International ASIC Conference and Exhibit.

[16]  Yasuhiko Saito,et al.  SH-5: The 64-Bit SuperH Architecture , 2000, IEEE Micro.

[17]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[18]  Mary Jane Irwin,et al.  A Two-Dimensional, Distributed Logic Architecture , 1991, IEEE Trans. Computers.

[19]  Wen-Hsiung Chen,et al.  A Fast Computational Algorithm for the Discrete Cosine Transform , 1977, IEEE Trans. Commun..

[20]  Ruby B. Lee Subword parallelism with MAX-2 , 1996, IEEE Micro.