Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine

How to develop efficient and scalable parallel applications is the key challenge for emerging many-core architectures. We investigate this question by implementing and comparing two parallel H.264 decoders on the Cell architecture. It is expected that future many-cores will use a Cell-like local store memory hierarchy, rather than a non-scalable shared memory. The two implemented parallel algorithms, the Task Pool (TP) and the novel Ring-Line (RL) approach, both exploit macroblock-level parallelism. The TP implementation follows the master-slave paradigm and is very dynamic so that in theory perfect load balancing can be achieved. The RL approach is distributed and more predictable in the sense that the mapping of macroblocks to processing elements is fixed. This allows to better exploit data locality, to overlap communication with computation, and to reduce communication and synchronization overhead. While TP is more scalable in theory, the actual scalability favors RL. Using 16 SPEs, RL obtains a scalability of 12x, while TP achieves only 10.3x. More importantly, the absolute performance of RL is much higher. Using 16 SPEs, RL achieves a throughput of 139.6 frames per second (fps) while TP achieves only 76.6 fps. A large part of the additional performance advantage is due to hiding the memory latency. From the results we conclude that in order to fully leverage the performance of future many-cores, a centralized master should be avoided and the mapping of tasks to cores should be predictable in order to be able to hide the memory latency.

[1]  Erik B. van der Tol,et al.  Mapping of H.264 decoding on a multiprocessor architecture , 2003, IS&T/SPIE Electronic Imaging.

[2]  Ajay Luthra,et al.  Overview of the H.264/AVC video coding standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[3]  Yen-Kuang Chen,et al.  Implementation of H.264 decoder on general-purpose processors with media instructions , 2003, IS&T/SPIE Electronic Imaging.

[4]  Milind Girkar,et al.  Towards efficient multi-level threading of H.264 encoder on Intel hyper-threading architectures , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Amit Gulati,et al.  Efficient mapping of the H.264 encoding algorithm onto multiprocessor DSPs , 2005, IS&T/SPIE Electronic Imaging.

[6]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[7]  Manuel P. Malumbres,et al.  Hierarchical Parallelization of an H.264/AVC Video Encoder , 2006, International Symposium on Parallel Computing in Electrical Engineering (PARELEC'06).

[8]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[9]  Michael Roitzsch Slice-balancing H.264 video encoding for improved scalability of multicore decoding , 2007, EMSOFT '07.

[10]  Kue-Hwan Sihn,et al.  Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture , 2007, 2007 IEEE International Symposium on Signal Processing and Information Technology.

[11]  R. Iyer,et al.  Performance , Area and Bandwidth Implications on Large-scale CMP Cache Design , 2007 .

[12]  Mateo Valero,et al.  HD-VideoBench. A Benchmark for Evaluating High Definition Digital Video Applications , 2007, 2007 IEEE 10th International Symposium on Workload Characterization.

[13]  Nikitas J. Dimopoulos,et al.  Extended characterization of DMA transfers on the Cell BE processor , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[14]  Ben H. H. Juurlink,et al.  Parallel Scalability of Video Decoders , 2009, J. Signal Process. Syst..

[15]  Sarma B. K. Vrudhula,et al.  A scalable parallel H.264 decoder on the cell broadband engine architecture , 2009, CODES+ISSS '09.

[16]  Mateo Valero,et al.  Scalability of Macroblock-level Parallelism for H.264 Decoding , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[17]  Andrei Sergeevich Terechko,et al.  A Multithreaded Multicore System for Embedded Media Processing , 2011, Trans. High Perform. Embed. Archit. Compil..