A Review of SIMD Multimedia Extensions and their Usage in Scientific and Engineering Applications

The volume and complexity of data processed by today's personal computers are increasing exponentially, placing incredible demands on the microprocessors. In the meantime, computing performance that can be achieved by increasing the clock speed of a microprocessor is reaching to physical limits thus making the architectural solutions more prominent. Due to this an important architectural feature is added to recent microprocessors, single instruction multiple data (SIMD), which is a set of instructions that can speed up an application performance by allowing basic operation to be performed on multiple data elements in parallel with fewer instructions. The SIMD computational technique was introduced in the IA-32 Intel® architecture with MMX technology and then further enhanced with Intel's introduction of streaming SIMD extensions (SSE), SSE 2 (SSE2) and SSE 3 (SSE3). Although programming using these SIMD extensions enables software to achieve higher performance, several exiting scientific applications are not affected. This paper gives an overview of SIMD multimedia extensions. The features of these extensions are introduced. Available methods for programming with multimedia instruction sets are discussed. It also reviews recent trends to use multimedia extensions to accelerate many applications such as multimedia, scientific and engineering applications, and argues for further use in other significant computationally intensive applications.

[1]  Ruby B. Lee Accelerating multimedia with enhanced microprocessors , 1995, IEEE Micro.

[2]  Alejandro Ramírez Bellido,et al.  On the scalability of 1- and 2-dimensional SIMD extensions for multimedia applications , 2005 .

[3]  Ariel Ortiz Teaching the SIMD execution model:: assembling a few parallel programming skills , 2003, SIGCSE.

[4]  Stamatis Vassiliadis,et al.  Performance Impact of Misaligned Accesses in SIMD Extensions , 2006 .

[5]  Richard H. Stern Net access-divvying up the pie [Copyright and the Internet] , 1996, IEEE Micro.

[6]  Chia-Lin Yang,et al.  Using Intel Streaming SIMD Extensions for 3D Geometry Processing , 2002, IEEE Pacific Rim Conference on Multimedia.

[7]  Ville Lappalainen Performance of an advanced video codec on a general-purpose processor with media ISA extensions , 2000, 2000 Digest of Technical Papers. International Conference on Consumer Electronics. Nineteenth in the Series (Cat. No.00CH37102).

[8]  Yen-Kuang Chen,et al.  Implementation of H.264 encoder on general-purpose processors with hyper-threading technology , 2004, IS&T/SPIE Electronic Imaging.

[9]  Yu-Fai Fung,et al.  A parallel solution to linear systems , 2002, Microprocess. Microsystems.

[10]  Lizy Kurian John,et al.  Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements , 2003, IEEE Trans. Computers.

[11]  Uri C. Weiser,et al.  MMX technology extension to the Intel architecture , 1996, IEEE Micro.

[12]  Jose Fridman Data alignment for sub-word parallelism in DSP , 1999, 1999 IEEE Workshop on Signal Processing Systems. SiPS 99. Design and Implementation (Cat. No.99TH8461).

[13]  Fred Weber,et al.  AMD 3DNow! technology: architecture and implementations , 1999, IEEE Micro.

[14]  Antonio Carlos,et al.  Improving processing time of large images by instruction level parallelism , 2001 .

[15]  Faouzi Kossentini,et al.  Efficient coding and mapping algorithms for software-only real-time video coding at low bit rates , 2000, IEEE Trans. Circuits Syst. Video Technol..

[16]  Gerhard Fettweis,et al.  Compiler based exploration of DSP energy savings by SIMD operations , 2004, ASP-DAC.

[17]  R. Nigel Horspool,et al.  Compiler optimizations for processors with SIMD instructions , 2007, Softw. Pract. Exp..

[18]  Gonzalo Travieso,et al.  Matrix calculations with SIMD floating point instructions on x 86 processors , 2001 .

[19]  José González,et al.  Reducing 3D Fast Wavelet Transform Execution Time Using Blocking and the Streaming SIMD Extensions , 2005, J. VLSI Signal Process..

[20]  Isom L. Crawford,et al.  Software Optimization for High Performance Computers , 2000 .

[21]  Ja-Ling Wu,et al.  MMX-based DCT and MC algorithms for real-time pure software MPEG decoding , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[22]  Y. Fisher Fractal image compression: theory and application , 1995 .

[23]  Kazumaro Aoki,et al.  Elliptic Curve Arithmetic Using SIMD , 2001, ISC.

[24]  Moinul H. Khan,et al.  Accelerating mobile multimedia using Intel Wireless MMX™ technology. , 2004 .

[25]  Peter Pirsch,et al.  Instruction Set Extensions for MPEG-4 Video , 1999, J. VLSI Signal Process..

[26]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[27]  Vladimir M. Pentkovski,et al.  Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[28]  Gang Ren,et al.  A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions , 2003, LCPC.

[29]  Norman P. Jouppi,et al.  Performance of image and video processing with general-purpose processors and media ISA extensions , 1999, ISCA.

[30]  Peng Wu,et al.  Efficient SIMD code generation for runtime alignment and length conversion , 2005, International Symposium on Code Generation and Optimization.

[31]  Yen-Kuang Chen,et al.  Video applications on hyper-threading technology , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[32]  Alan Jay Smith,et al.  Multimedia extensions for general purpose microprocessors: a survey , 2005, Microprocess. Microsystems.

[33]  Kenneth A. Ross,et al.  Implementing database operations using SIMD instructions , 2002, SIGMOD '02.

[34]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[35]  F. Sanchez,et al.  Parallel processing in biological sequence comparison using general purpose processors , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[36]  Pankaj Godbole Optimizing the advanced encryption standard on Intel's SIMD architecture , 2004 .

[37]  Andreas Krall,et al.  Compiler optimizations for processors with SIMD instructions , 2007, Softw. Pract. Exp..

[38]  Stamatis Vassiliadis,et al.  Performance comparison of SIMD implementations of the discrete wavelet transform , 2005, 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).

[39]  Henry G. Dietz,et al.  The Scc Compiler: SWARing at MMX 3DNow! , 1999, LCPC.

[40]  Emmett Witchel,et al.  Increasing and detecting memory address congruence , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[41]  Michael D. Smith,et al.  Geust Editorial: Media processing: a new design target , 1996, IEEE Micro.

[42]  B. Reese,et al.  Real-time H.24-AVC codec on Intel architectures , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[43]  Edward A. Lee,et al.  DSP Processor Fundamentals: Architectures and Features , 1997 .

[44]  Francisco Tirado Fernández,et al.  2-D wavelet transform enhancement on general-purpose microprocessors: memory hierarchy and SIMD parallelism exploitation , 2002 .

[45]  Yen-Kuang Chen,et al.  Implementation of H.264 encoder and decoder on personal computers , 2006, J. Vis. Commun. Image Represent..

[46]  Francesco Zanichelli,et al.  The long and winding road to high-performance image processing with MMX/SSE , 2000, Proceedings Fifth IEEE International Workshop on Computer Architectures for Machine Perception.

[47]  Masao Ikekawa,et al.  Fast 2D IDCT implementation with multimedia instructions for a software MPEG2 decoder , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[48]  Mateo Valero,et al.  Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[49]  Shorin Kyo,et al.  AN EXTENDED C LANGUAGE AND A SIMD COMPILER FOR EFFICIENT IMPLEMENTATION OF IMAGE FILTERS ON MEDIA EXTENDED MICRO-PROCESSORS , 2003 .

[50]  Douglas Aberdeen,et al.  General Matrix-Matrix Multiplication Using SIMD Features of the PIII (Research Note) , 2000, Euro-Par.

[51]  Takashi Miyazaki,et al.  Radix-4 FFT implementation using SIMD multimedia instructions , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[52]  A. Uhl,et al.  SIMD Parallelization of Common Wavelet Filters , 2005 .

[53]  Peter Pirsch,et al.  VLSI architectures for video compression-a survey , 1995, Proc. IEEE.

[54]  Gregory W. Heckler,et al.  SIMD correlator library for GNSS software receivers , 2006 .

[55]  Dinesh Manocha,et al.  Fast computation of database operations using graphics processors , 2004, SIGMOD '04.

[56]  Stamatis Vassiliadis,et al.  Limitations of special-purpose instructions for similarity measurements in media SIMD extensions , 2006, CASES '06.

[57]  Ayal Zaks,et al.  Compiler Vectorization Techniques for a Disjoint SIMD Architecture , 2002 .

[58]  Stamatis Vassiliadis,et al.  Instruction set architecture enhancements for video processing , 2005, 2005 IEEE International Conference on Application-Specific Systems, Architecture Processors (ASAP'05).

[59]  Ramesh Radhakrishnan,et al.  Evaluating MMX technology using DSP and multimedia applications , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[60]  Ville Lappalainen,et al.  Overview of research efforts on media ISA extensions and their usage in video coding , 2002, IEEE Trans. Circuits Syst. Video Technol..

[61]  Franz Franchetti,et al.  Efficient Utilization of SIMD Extensions , 2005, Proceedings of the IEEE.

[62]  Douglas Aberdeen,et al.  Emmerald: a fast matrix–matrix multiply using Intel's SSE instructions , 2001, Concurr. Comput. Pract. Exp..

[63]  Uri C. Weiser,et al.  Intel's MMX/sup TM/ technology-a new instruction set extension , 1997, Proceedings IEEE COMPCON 97. Digest of Papers.

[64]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[65]  Yen-Kuang Chen,et al.  Implementation of H.264 decoder on general-purpose processors with media instructions , 2003, IS&T/SPIE Electronic Imaging.

[66]  Aart J. C. Bik,et al.  Automatic Intra-Register Vectorization for the Intel® Architecture , 2002, International Journal of Parallel Programming.

[67]  Franz Franchetti,et al.  SIMD Vectorization of Straight Line FFT Code , 2003, Euro-Par.

[68]  Joos Vandewalle,et al.  Fast Hashing on the Pentium , 1996, CRYPTO.

[69]  Chia-Lin Yang,et al.  Exploiting Parallelism in Geometry Processing with General Purpose Processors and Floating-Point SIMD Instructions , 2000, IEEE Trans. Computers.

[70]  Omar Hammami,et al.  Application-specific SIMD synthesis for reconfigurable architectures , 2006, Microprocess. Microsystems.

[71]  Ville Lappalainen,et al.  Performance analysis of Intel MMX technology for an H.263 video H.263 video encoder , 1998, MULTIMEDIA '98.

[72]  Francisco Tirado,et al.  Vectorization of the 2D wavelet lifting transform using SIMD extensions , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[73]  D. Naishlos,et al.  Autovectorization in GCC , 2004 .

[74]  Aart J. C. Bik,et al.  Multimedia vectorization of floating‐point MIN/MAX reductions , 2006, Concurr. Comput. Pract. Exp..

[75]  Georges-André Silber,et al.  An Empirical Study of Some x 86 SIMD Integer Extensions , 2005 .

[76]  Tsuyoshi Takagi,et al.  Fast Elliptic Curve Multiplications with SIMD Operations , 2004, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[77]  Patricio Bulic,et al.  An Extended ANSI C for Processors with a Multimedia Extension , 2004, International Journal of Parallel Programming.

[78]  T. Rognes,et al.  ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. , 2001, Nucleic acids research.

[79]  Andreas Krall,et al.  Compilation Techniques for Multimedia Processors , 2004, International Journal of Parallel Programming.

[80]  Ruby B. Lee Multimedia extensions for general-purpose processors , 1997, 1997 IEEE Workshop on Signal Processing Systems. SiPS 97 Design and Implementation formerly VLSI Signal Processing.

[81]  I. Kuroda,et al.  Multimedia processors , 1998, Proc. IEEE.

[82]  Shreekant S. Thakkar,et al.  Internet Streaming SIMD Extensions , 1999, Computer.

[83]  Ruby B. Lee,et al.  Refining instruction set architecture for high-performance multimedia processing in constrained environments , 2002, Proceedings IEEE International Conference on Application- Specific Systems, Architectures, and Processors.

[84]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[85]  Peng Wu,et al.  Vectorization for SIMD architectures with alignment constraints , 2004, PLDI '04.

[86]  Ruby B. Lee,et al.  Algorithmic and architectural enhancements for real-time MPEG-1 decoding on a general purpose RISC workstation , 1995, IEEE Trans. Circuits Syst. Video Technol..

[87]  V. Paul Rodriguez A radix-2 FFT algorithm for Modern Single Instruction Multiple Data (SIMD) architectures , 2002 .

[88]  Federico Tombari,et al.  Speeding-up NCC-based template matching using parallel multimedia instructions , 2005, Seventh International Workshop on Computer Architecture for Machine Perception (CAMP'05).

[89]  Nathan T. Slingerland 1 Performance Analysis of Instruction Set Architecture Extensions for Multimedia § , 2001 .

[90]  R. Govindarajan,et al.  A Vectorizing Compiler for Multimedia Extensions , 2000, International Journal of Parallel Programming.

[91]  J. Stoer,et al.  Introduction to Numerical Analysis , 2002 .

[92]  S. Krishnaprasad SIMD programming illustrated using Intel's MMX instruction set , 2004 .

[93]  Sameh W. Asaad,et al.  An innovative low-power high-performance programmable signal processor for digital communications , 2003, IBM J. Res. Dev..

[94]  W. Paul Cockshott,et al.  Orthogonal parallel processing in vector Pascal , 2006, Comput. Lang. Syst. Struct..

[95]  Alan Jay Smith,et al.  Measuring the Performance of Multimedia Instruction Sets , 2002, IEEE Trans. Computers.

[96]  Alessandro Lonardo,et al.  C++ programming language for an abstract massively parallel SIMD architecture , 2000, ArXiv.

[97]  Peter Kogge,et al.  Generation of permutations for SIMD processors , 2005, LCTES '05.

[98]  Ariel Ortiz Ramirez An Overview of Intel's MMX Technology , 1999 .

[99]  R. Leupers Code selection for media processors with SIMD instructions , 2000, Proceedings Design, Automation and Test in Europe Conference and Exhibition 2000 (Cat. No. PR00537).

[100]  Charles Roth,et al.  A low-power, high-speed implementation of a PowerPC/sup TM/ microprocessor vector extension , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).

[101]  Xiandong Meng,et al.  Optimised fine and coarse parallelism for sequence homology search , 2006, Int. J. Bioinform. Res. Appl..

[102]  Hye-Jeong Cho,et al.  An Efficient SIMD-based Quarter-Pixel Interpolation Method for H.264/AVC , 2006 .

[103]  Insung Ihm,et al.  SIMD Optimization of Linear Expressions for Programmable Graphics Hardware , 2004, Comput. Graph. Forum.

[104]  Hamid Sarbazi-Azad,et al.  Efficient polynomial root finding using SIMD extensions , 2005, 11th International Conference on Parallel and Distributed Systems (ICPADS'05).

[105]  Chew Yean Yam Optimizing Video Compression for Intel ® Digital Security Surveillance applications with SIMD and Hyper-Threading Technology by Chew Yean Yam Intel Corporation , 2005 .

[106]  Gang Ren,et al.  An empirical study on the vectorization of multimedia applications for multimedia extensions , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[107]  Xinmin Tian,et al.  Efficient multithreading implementation of H.264 encoder on Intel hyper-threading architectures , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[108]  Henry G. Dietz,et al.  Compiling for SIMD Within a Register , 1998, LCPC.