Improving the scalability of multicore systems with a focus on H.264 video decoding

In pursuit of ever increasing performance, more and more processor architectures have become multicore processors. As clock frequency was no longer increasing rapidly and ILP techniques showed diminishing results, increasing the number of cores per chip was the natural choice. The transistor budget is still increasing and thus it is expected that within ten years chips can contain hundreds of high performance cores. Scaling the number of cores, however, does not necessarily translate into an equal scaling of performance. In this thesis, we propose several techniques to improve the performance scalability of multicore systems. With those techniques we address several key challenges of the multicore area. First, we investigate the effect of the power wall on future multicore architecture. Our model includes predictions of technology improvements, analysis of symmetric and asymmetric multicores, as well as the influence of Amdahl's Law. Second, we investigate the parallelization of the H.264 video decoding application, thereby addressing application scalability. Existing parallelization strategies are discussed and a novel strategy is proposed. Analysis shows that using the new parallelization strategy the amount of available parallelism is in the order of thousands. Several implementations of the strategy are discussed, which show the difficulty and the possibility of actually exploiting the available parallelism. Third, we propose an Application Specific Instruction Set (ASIP) processor for H.264 decoding, based on the Cell SPE. ASIPs are energy efficient and allow performance scaling in systems that are limited by the power budget. Finally, we propose hardware support for task management, of which the benefits are two-fold. First, it supports the SARC programming model, which is a task-based dataflow programming model based on StarSS. By providing hardware support for the most time-consuming part of the runtime system, it improves the scalability. Second, it reduces the parallelization overhead, such as synchronization, by providing fast hardware primitives.

[1]  Milind Girkar,et al.  Towards efficient multi-level threading of H.264 encoder on Intel hyper-threading architectures , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[2]  Nur Engin,et al.  CVP : a programmable Co Vector Processor for 3G mobile baseband processing , 2003 .

[3]  T. Fujiyoshi,et al.  A 63-mW H.264/MPEG-4 audio/visual codec LSI with module-wise dynamic Voltage/frequency scaling , 2006, IEEE Journal of Solid-State Circuits.

[4]  Xiaobo Sharon Hu,et al.  Linear-time matrix transpose algorithms using vector register file with diagonal registers , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[5]  Stamatis Vassiliadis,et al.  The TM3270 media-processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[6]  Jani Lainema,et al.  Adaptive deblocking filter , 2003, IEEE Trans. Circuits Syst. Video Technol..

[7]  Paraskevas Evripidou,et al.  Chip multiprocessor based on data-driven multithreading model , 2007, Int. J. High Perform. Syst. Archit..

[8]  Ajay Luthra,et al.  Overview of the H.264/AVC video coding standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[9]  Wonyong Sung,et al.  Efficient vectorization of SIMD programs with non-aligned and irregular data access hardware , 2008, CASES '08.

[10]  Yen-Kuang Chen,et al.  Implementation of H.264 encoder and decoder on personal computers , 2006, J. Vis. Commun. Image Represent..

[11]  Eduard Ayguadé,et al.  Hierarchical Task-Based Programming With StarSs , 2009, Int. J. High Perform. Comput. Appl..

[12]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[13]  Erik B. van der Tol,et al.  Mapping of H.264 decoding on a multiprocessor architecture , 2003, IS&T/SPIE Electronic Imaging.

[14]  David A. Bader,et al.  Optimizing JPEG2000 Still Image Encoding on the Cell Broadband Engine , 2008, 2008 37th International Conference on Parallel Processing.

[15]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[16]  Heiko Schwarz,et al.  Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[17]  Gerard de Haan,et al.  Application specific instruction-set processor template for motion estimation in video applications , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  C. C. Chi Parallel H.264 Decoding Strategies for Cell Broadband Engine , 2010 .

[19]  Markus Flierl,et al.  Generalized B pictures and the draft H.264/AVC video-compression standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[20]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[21]  Kevin D. Kissell,et al.  MIPS MT: A Multithreaded RISC Architecture for Embedded Real-Time Processing , 2008, HiPEAC.

[22]  Yen-Kuang Chen,et al.  Implementation of H.264 decoder on general-purpose processors with media instructions , 2003, IS&T/SPIE Electronic Imaging.

[23]  Yanjun Zhang,et al.  VS-ISA: A Video Specific Instruction Set Architecture for ASIP Design , 2006, IIH-MSP.

[24]  Uri C. Weiser,et al.  Intel MMX for multimedia PCs , 1997, Commun. ACM.

[25]  H. Peter Hofstee Power-constrained microprocessor design , 2002, Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors.

[26]  Roberto Giorgi,et al.  DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems , 2007, 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07).

[27]  Ben H. H. Juurlink,et al.  Parallel Scalability of Video Decoders , 2009, J. Signal Process. Syst..

[28]  Dean M. Tullsen,et al.  Proximity-aware directory-based coherence for multi-core processor architectures , 2007, SPAA '07.

[29]  Lurng-Kuo Liu,et al.  Video Analysis and Compression on the STI Cell Broadband Engine Processor , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[30]  Magnus Själander,et al.  A Look-Ahead Task Management Unit for Embedded Multi-Core Architectures , 2008, 2008 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools.

[31]  M. Moudgill,et al.  THE SANDBLASTER 2 . 0 ARCHITECTURE AND SB 3500 IMPLEMENTATION , 2008 .

[32]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[33]  G.S. Moschytz,et al.  Practical fast 1-D DCT algorithms with 11 multiplications , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[34]  Yan Solihin,et al.  Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.

[35]  Hanjin Cho,et al.  An area efficient video/audio codec for portable multimedia application , 2000, 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353).

[36]  Coniferous softwood GENERAL TERMS , 2003 .

[37]  J. M. Pierre Langlois,et al.  Application Specific Instruction set processor specialized for block motion estimation , 2008, 2008 IEEE International Conference on Computer Design.

[38]  Mateo Valero,et al.  HD-VideoBench. A Benchmark for Evaluating High Definition Digital Video Applications , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[39]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[40]  Stamatis Vassiliadis,et al.  The CSI multimedia architecture , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[41]  Chih-Wei Liu,et al.  Multithreaded coprocessor interface for multi-core multimedia SoC , 2008, 2008 Asia and South Pacific Design Automation Conference.

[42]  Roberto Giorgi,et al.  Introducing Hardware TLP Support in the Cell Processor , 2009, 2009 International Conference on Complex, Intelligent and Software Intensive Systems.

[43]  Kurt Keutzer,et al.  Efficient Parallelization of H.264 Decoding with Macro Block Level Scheduling , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[44]  Lizy Kurian John,et al.  Cost-effective hardware acceleration of multimedia applications , 2001, Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001.

[45]  Ruby B. Lee,et al.  64-bit and multimedia extensions in the PA-RISC 2.0 architecture , 1996, COMPCON '96. Technologies for the Information Superhighway Digest of Papers.

[46]  Andrew Wolfe,et al.  Available parallelism in video applications , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[47]  Hunter Scales,et al.  AltiVec Extension to PowerPC Accelerates Media Processing , 2000, IEEE Micro.

[48]  Javier D. Bruguera,et al.  An FPGA architecture for CABAC decoding in manycore systems , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[49]  Dongrui Fan,et al.  Architectural support for cilk computations on many-core architectures , 2009, PPoPP '09.

[50]  Ben H. H. Juurlink,et al.  Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine , 2010, ICS '10.

[51]  Jiun-In Guo,et al.  An efficient 2-D DCT/IDCT core design using cyclic convolution and adder-based realization , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[52]  Mark D. Hill,et al.  Amdahl's Law in the Multicore Era , 2008, Computer.

[53]  Ben H. H. Juurlink,et al.  Analysis of video filtering on the cell processor , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[54]  Yuan Shi Reevaluating Amdahl's Law and Gustafson's Law , 1996 .

[55]  I. Daubechies,et al.  Factoring wavelet transforms into lifting steps , 1998 .

[56]  K. R. Rao,et al.  An overview of H.264/MPEG-4 Part 10 , 2003, Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No.03EX667).

[57]  J. O. Eklundh,et al.  A Fast Computer Method for Matrix Transposing , 1972, IEEE Transactions on Computers.

[58]  Zhigang Cao,et al.  New cost-effective VLSI implementation of a 2-D discrete cosine transform and its inverse , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[59]  Stéphane Mallat,et al.  A Theory for Multiresolution Signal Decomposition: The Wavelet Representation , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[60]  Zhuo Zhao,et al.  Data partition for wavefront parallelization of H.264 video encoder , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[61]  Jong-Myon Kim,et al.  Quantized color instruction set for media-on-demand applications , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[62]  Ingrid Verbauwhede,et al.  Low power DSP's for wireless communications , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).

[63]  David A. Patterson,et al.  Computer Architecture - A Quantitative Approach (4. ed.) , 2007 .

[64]  Lifeng Sun,et al.  Spatial and Temporal Data Parallelization of Multi-view Video Encoding Algorithm , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[65]  Ajay Luthra,et al.  The H.264/AVC Advanced Video Coding standard: overview and introduction to the fidelity range extensions , 2004, SPIE Optics + Photonics.

[66]  Ben H. H. Juurlink,et al.  Extending the Cell SPE with Energy Efficient Branch Prediction , 2010, Euro-Par.

[67]  Andrei Sergeevich Terechko,et al.  A Multithreaded Multicore System for Embedded Media Processing , 2011, Trans. High Perform. Embed. Archit. Compil..

[68]  Rainer Leupers,et al.  Task management in MPSoCs: An ASIP approach , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[69]  Mateo Valero,et al.  Performance evaluation of macroblock-level parallelization of H.264 decoding on a cc-NUMA multiprocessor architecture , 2009 .

[70]  K. Ohmori,et al.  A 60 MHz 240 mW MPEG-4 video-phone LSI with 16 Mb embedded DRAM , 2000, 2000 IEEE International Solid-State Circuits Conference. Digest of Technical Papers (Cat. No.00CH37056).

[71]  K. Suzuki,et al.  A 2000-MOPS embedded RISC processor with a Rambus DRAM controller , 1999 .

[72]  Jeffrey Scott Vitter Implementations for coalesced hashing , 1982, CACM.

[73]  Benjamin C. Lee,et al.  Effects of pipeline complexity on SMT/CMP power-performance efficiency , 2005 .

[74]  Henrique S. Malvar,et al.  Low-complexity transform and quantization with 16-bit arithmetic for H.26L , 2002, Proceedings. International Conference on Image Processing.

[75]  H. Takata,et al.  The D30V/MPEG multimedia processor , 1999, IEEE Micro.

[76]  William J. Dally,et al.  Imagine: Media Processing with Streams , 2001, IEEE Micro.

[77]  Paraskevas Evripidou,et al.  Programming Abstractions and Toolchain for Dataflow Multithreading Architectures , 2009, 2009 Eighth International Symposium on Parallel and Distributed Computing.

[78]  Hyunseok Lee,et al.  SODA: A High-Performance DSP Architecture for Software-Defined Radio , 2007, IEEE Micro.

[79]  B. Flachs,et al.  The microarchitecture of the synergistic processor for a cell processor , 2006, IEEE Journal of Solid-State Circuits.

[80]  Henk Corporaal,et al.  Automatic detection of recurring operation patterns , 1999, Proceedings of the Seventh International Workshop on Hardware/Software Codesign (CODES'99) (IEEE Cat. No.99TH8450).

[81]  Soontorn Oraintara,et al.  Complexity comparison of fast block-matching motion estimation algorithms , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[82]  Wonyong Sung,et al.  H.264 decoder optimization exploiting SIMD instructions , 2004, The 2004 IEEE Asia-Pacific Conference on Circuits and Systems, 2004. Proceedings..

[83]  Seung-Min Lee,et al.  High-speed and low-power real-time programmable video multi-processor for MPEG-2 multimedia chip on 0.6 /spl mu/m TLM CMOS technology , 1999, Proceedings of the ASP-DAC '99 Asia and South Pacific Design Automation Conference 1999 (Cat. No.99EX198).

[84]  John L. Gustafson,et al.  Reevaluating Amdahl's law , 1988, CACM.

[85]  N. O V E M B,et al.  Digital, MIPS Add Multimedia Extensions: 11/18/96 , 1996 .

[86]  Wai-Yip Chan,et al.  Performance improvement of the H.264/AVC deblocking filter using SIMD instructions , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[87]  Henrique S. Malvar,et al.  Low-complexity transform and quantization in H.264/AVC , 2003, IEEE Trans. Circuits Syst. Video Technol..

[88]  Alan Jay Smith,et al.  Measuring the Performance of Multimedia Instruction Sets , 2002, IEEE Trans. Computers.

[89]  Stamatis Vassiliadis,et al.  Implementing the 2-D Wavelet Transform on SIMD-Enhanced General-Purpose Processors , 2008, IEEE Transactions on Multimedia.

[90]  Mathias Wien,et al.  Variable block-size transforms for H.264/AVC , 2003, IEEE Trans. Circuits Syst. Video Technol..

[91]  Yang Song,et al.  A Hardware Architecture of CABAC Encoding and Decoding with Dynamic Pipeline for H.264/AVC , 2008, J. Signal Process. Syst..

[92]  Manuel P. Malumbres,et al.  Hierarchical Parallelization of an H.264/AVC Video Encoder , 2006, International Symposium on Parallel Computing in Electrical Engineering (PARELEC'06).

[93]  E. Salami,et al.  A performance characterization of high definition digital video decoding using H.264/AVC , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[94]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[95]  Mateo Valero,et al.  Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[96]  Hsien-Hsin S. Lee,et al.  Extending Amdahl's Law for Energy-Efficient Computing in the Many-Core Era , 2008, Computer.

[97]  Youn-Long Lin,et al.  A hardware accelerator for context-based adaptive binary arithmetic decoding in H.264/AVC , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[98]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[99]  Yen-Kuang Chen,et al.  ALP: Efficient support for all levels of parallelism for complex media applications , 2007, TACO.

[100]  Andrei Sergeevich Terechko,et al.  A Hardware Task Scheduler for Embedded Video Processing , 2008, HiPEAC.

[101]  Stamatis Vassiliadis,et al.  Performance Impact of Misaligned Accesses in SIMD Extensions , 2006 .

[102]  Stamatis Vassiliadis,et al.  An 8x8 IDCT Implementation on an FPGA-Augmented TriMedia , 2001, The 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'01).

[103]  Amit Gulati,et al.  Efficient mapping of the H.264 encoding algorithm onto multiprocessor DSPs , 2005, IS&T/SPIE Electronic Imaging.

[104]  Sergio Bampi,et al.  A Pipelined 8x8 2-D Forward DCT Hardware Architecture for H.264/AVC High Profile Encoder , 2007, PSIVT.

[105]  Peter Pirsch,et al.  Instruction Set Extensions for MPEG-4 Video , 1999, J. VLSI Signal Process..

[106]  Michael Roitzsch Slice-balancing H.264 video encoding for improved scalability of multicore decoding , 2007, EMSOFT '07.

[107]  Mateo Valero,et al.  Scalability of Macroblock-level Parallelism for H.264 Decoding , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[108]  Vladimir M. Pentkovski,et al.  Implementing Streaming SIMD Extensions on the Pentium III Processor , 2000, IEEE Micro.

[109]  Klaus Schöffmann,et al.  An Evaluation of Parallelization Concepts for Baseline-Profile Compliant H.264/AVC Decoders , 2007, Euro-Par.

[110]  Christoforos E. Kozyrakis,et al.  Scalable Vector Processors for Embedded Systems , 2003, IEEE Micro.

[111]  Mateo Valero,et al.  A Highly Scalable Parallel Implementation of H.264 , 2011, Trans. High Perform. Embed. Archit. Compil..