A Highly Scalable Parallel Implementation of H.264

Developing parallel applications that can harness and efficiently use future many-core architectures is the key challenge for scalable computing systems. We contribute to this challenge by presenting a parallel implementation of H.264 that scales to a large number of cores. The algorithm exploits the fact that independent macroblocks (MBs) can be processed in parallel, but whereas a previous approach exploits only intra-frame MB-level parallelism, our algorithm exploits intra-frame as well as inter-frame MB-level parallelism. It is based on the observation that inter-frame dependencies have a limited spatial range. The algorithm has been implemented on a many-core architecture consisting of NXP TriMedia TM3270 embedded processors. This required to develop a subscription mechanism, where MBs are subscribed to the kick-off lists associated with the reference MBs. Extensive simulation results show that the implementation scales very well, achieving a speedup of more than 54 on a 64-core processor, in which case the previous approach achieves a speedup of only 23. Potential drawbacks of the 3D-Wave strategy are that the memory requirements increase since there can be many frames in flight, and that the frame latency might increase. Scheduling policies to address these drawbacks are also presented. The results show that these policies combat memory and latency issues with a negligible effect on the performance scalability. Results analyzing the impact of the memory latency, L1 cache size, and the synchronization and thread management overhead are also presented. Finally, we present performance requirements for entropy (CABAC) decoding.

[1]  Mateo Valero,et al.  Performance evaluation of macroblock-level parallelization of H.264 decoding on a cc-NUMA multiprocessor architecture , 2009 .

[2]  D. Marpe,et al.  Video coding with H.264/AVC: tools, performance, and complexity , 2004, IEEE Circuits and Systems Magazine.

[3]  Thomas Sikora,et al.  Extending Single-View Scalable Video Coding to Multi-View Based on H.264/AVC , 2006, 2006 International Conference on Image Processing.

[4]  Yen-Kuang Chen,et al.  Implementation of H.264 encoder and decoder on personal computers , 2006, J. Vis. Commun. Image Represent..

[5]  B. Juurlink,et al.  Parallel Scalability of H . 264 , 2007 .

[6]  Mateo Valero,et al.  HD-VideoBench. A Benchmark for Evaluating High Definition Digital Video Applications , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[7]  M. Kanazawa,et al.  Ultrahigh-Definition Television System , 2004 .

[8]  Andrei Sergeevich Terechko,et al.  A Multithreaded Multicore System for Embedded Media Processing , 2011, Trans. High Perform. Embed. Archit. Compil..

[9]  Javier D. Bruguera,et al.  An FPGA architecture for CABAC decoding in manycore systems , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[10]  Stamatis Vassiliadis,et al.  The TM3270 media-processor , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[11]  Manuel P. Malumbres,et al.  Hierarchical Parallelization of an H.264/AVC Video Encoder , 2006, International Symposium on Parallel Computing in Electrical Engineering (PARELEC'06).

[12]  E. Salami,et al.  A performance characterization of high definition digital video decoding using H.264/AVC , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[13]  Thomas Wiegand,et al.  Draft ITU-T recommendation and final draft international standard of joint video specification , 2003 .

[14]  Ben H. H. Juurlink,et al.  Parallel Scalability of Video Decoders , 2009, J. Signal Process. Syst..

[15]  Erik B. van der Tol,et al.  Mapping of H.264 decoding on a multiprocessor architecture , 2003, IS&T/SPIE Electronic Imaging.