Efficient Parallel Framework for H.264/AVC Deblocking Filter on Many-Core Platform

The H.264/AVC deblocking filter is becoming the performance bottleneck of H.264/AVC parallelization on many-core platform. Efficient parallelization of the deblocking filter on a many-core platform is challenging, because the deblocking filter has complicated data dependencies, which provide insufficient parallelism for so many cores. Furthermore, parallelization may have significant synchronization and load imbalance overhead. At present, research on the parallelizing deblocking filter on a many-core platform is rare and focuses on data-level parallelization. In this paper, we propose a three-step framework considering task-level segmentation and data-level parallelization to efficiently parallelize the deblocking filter. First, we review the entire deblocking filter process in 4 × 4 block edge-level and divide it into two parts: 1) boundary strength computation (BSC) and 2) edge discrimination and filtering (EDF), which increases the parallelism. Then, we apply the Markov empirical transition probability matrix and Huffman tree (METPMHT) to the BSC, which alleviate the load imbalance problem. Finally, we use an independent pixel connected area parallelization (IPCAP) for the EDF, which increases the parallelism and reduces the synchronization. In experiments, we apply our parallel method to the deblocking filter of the H.264/AVC reference software JM15.1 on the Tile64 platform without any Tile64 platform-based optimizations. Compared to the well-known 2D-wavefront method, the proposed method achieves on average 14.85, 17.83, and 10.60 times speed-up for QCIF, CIF, and HD videos using 62 cores, respectively.

[1]  Víctor M. Gulías,et al.  GPU-based fast motion estimation for on-the-fly encoding of computer-generated video streams , 2011, NOSSDAV '11.

[2]  Jingqiao Zhang,et al.  Evolutionary optimization of transition probability matrices for credit decision-making , 2010, Eur. J. Oper. Res..

[3]  Bernd Freisleben,et al.  Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding , 2009, IEEE Transactions on Multimedia.

[4]  Kyu Ho Park,et al.  Variable block-based deblocking filter for H.264/AVC on low-end and low-bit rates terminals , 2010, Signal Process. Image Commun..

[5]  Jong-Tae Kim,et al.  Novel approaches to parallel H.264 decoder on symmetric multicore systems , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Onur Mutlu,et al.  Prefetch-aware shared-resource management for multi-core systems , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[7]  Oscar C. Au,et al.  Video Coding On Multi-Core Graphics Processors , 2009 .

[8]  Jani Lainema,et al.  Adaptive deblocking filter , 2003, IEEE Trans. Circuits Syst. Video Technol..

[9]  Yongdong Zhang,et al.  Parallel spatial matching for object retrieval implemented on GPU , 2010, 2010 IEEE International Conference on Multimedia and Expo.

[10]  Oscar C. Au,et al.  Parallel rate-distortion optimized intra mode decision on multi-core graphics processors using greedy-based encoding orders , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[11]  Ben H. H. Juurlink,et al.  Parallel H.264 Decoding on an Embedded Multicore Processor , 2009, HiPEAC.

[12]  Yongdong Zhang,et al.  Parallel deblocking filter for H.264/AVC implemented on Tile64 platform , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[13]  J.L. Sanchez,et al.  Accelerating H.264 inter prediction in a GPU by using CUDA , 2010, 2010 Digest of Technical Papers International Conference on Consumer Electronics (ICCE).

[14]  Yong Ho Song,et al.  Efficient coordination of parallel threads of H.264/AVC decoder for performance improvement , 2010, IEEE Transactions on Consumer Electronics.

[15]  Gary J. Sullivan,et al.  Rate-constrained coder control and comparison of video coding standards , 2003, IEEE Trans. Circuits Syst. Video Technol..

[16]  Ben H. H. Juurlink,et al.  A QHD-capable parallel H.264 decoder , 2011, ICS '11.

[17]  Pradip Bose,et al.  A case for guarded power gating for multi-core processors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[18]  Anantha Chandrakasan,et al.  A high throughput CABAC algorithm using syntax element partitioning , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[19]  Yi He,et al.  A Parallel Streaming Motion Estimation for Real-Time HD H.264 Encoding on Programmable Processors , 2010, 2010 Fifth International Conference on Frontier of Computer Science and Technology.

[20]  Mohamed Abid,et al.  High level H.264/AVC video encoder parallelization for multiprocessor implementation , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[21]  Ki-Seok Chung,et al.  Multi-threaded syntax element partitioning for parallel entropy decoding , 2011, IEEE Transactions on Consumer Electronics.

[22]  Heonshik Shin,et al.  Parallelizing the H.264 decoder on the cell BE architecture , 2010, EMSOFT '10.

[23]  Ja-Ling Wu,et al.  Scalable computation for spatially scalable video coding using NVIDIA CUDA and multi-core CPU , 2009, MM '09.

[24]  David Wentzlaff,et al.  Processor: A 64-Core SoC with Mesh Interconnect , 2010 .

[25]  Do-Hyung Kim,et al.  H.264 decoder on embedded dual core with dynamically load-balanced functional paritioning , 2010, 2010 IEEE International Conference on Image Processing.

[26]  Oscar C. Au,et al.  Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Ben H. H. Juurlink,et al.  Evaluation of parallel H.264 decoding strategies for the Cell Broadband Engine , 2010, ICS '10.

[28]  Jun-Young Lee,et al.  Multi-core platform for an efficient H.264 and VC-1 video decoding based on macroblock row-level parallelism , 2010, IET Circuits Devices Syst..

[29]  Erich Marth,et al.  Parallelization of the x264 encoder using OpenCL , 2010, SIGGRAPH '10.

[30]  Ja-Ling Wu,et al.  A Parallel Algorithm for H.264/AVC Deblocking Filter Based on Limited Error Propagation Effect , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[31]  Yongdong Zhang,et al.  Effective and Efficient Image Copy Detection Based on GPU , 2010, ECCV Workshops.

[32]  Chia-Lin Yang,et al.  A Multi-core Architecture Based Parallel Framework for H.264/AVC Deblocking Filters , 2009, J. Signal Process. Syst..

[33]  Yongdong Zhang,et al.  Parallel Deblocking Filter for H.264/AVC on the TILERA Many-Core Systems , 2011, MMM.

[34]  Yongdong Zhang,et al.  GPU-based fast scale invariant interest point detector , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Martin Schumacher,et al.  Empirical Transition Matrix of Multi-State Models: The etm Package , 2011 .

[36]  Satoshi Goto,et al.  Intra prediction architecture for H.264/AVC QFHD encoder , 2010, 28th Picture Coding Symposium.

[37]  Mateo Valero,et al.  Scalability of Macroblock-level Parallelism for H.264 Decoding , 2009, 2009 15th International Conference on Parallel and Distributed Systems.

[38]  Ming-Ting Sun,et al.  H.264 Deblocking Speedup , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Oscar C. Au,et al.  Video Coding on Multicore Graphics Processors , 2010, IEEE Signal Processing Magazine.

[40]  George Vafiadis,et al.  Fast motion estimation using configurable and extendible processing cores , 2009, 2009 Conference Record of the Forty-Third Asilomar Conference on Signals, Systems and Computers.

[41]  Ming-Ting Sun,et al.  Statistical Analysis Based H.264 High Profile Deblocking Speedup , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[42]  Ki-Seok Chung,et al.  Stage-based frame-partitioned parallelization of H.264/AVC decoding , 2010, IEEE Transactions on Consumer Electronics.

[43]  Ja-Ling Wu,et al.  Performance improvement of distributed video coding by using block mode selection , 2010, ACM Multimedia.

[44]  Shuming Chen,et al.  P3-CABAC: A Nonstandard Tri-Thread Parallel Evolution of CABAC in the Manycore Era , 2010, IEEE Transactions on Circuits and Systems for Video Technology.

[45]  Ben H. H. Juurlink,et al.  Parallel video decoding in the emerging HEVC standard , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[46]  Charlie Chung-Ping Chen,et al.  GOP-level parallelization of the H.264 decoder without a start-code scanner , 2010, 2010 2nd International Conference on Signal Processing Systems.

[47]  Ja-Ling Wu,et al.  An Efficient Distributed Video Coding with Parallelized Design for Concurrent Computing , 2011, 2011 Data Compression Conference.

[48]  Ajay Luthra,et al.  Overview of the H.264/AVC video coding standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[49]  Wang Hongpeng,et al.  Research of parallel decoding algrithm in H.264 on TILE64 , 2009, 2009 2nd IEEE International Conference on Broadband Network & Multimedia Technology.

[50]  Karl-Erik Årzén,et al.  Resource Management on Multicore Systems: The ACTORS Approach , 2011, IEEE Micro.

[51]  Ja-Ling Wu,et al.  Fast decoding for LDPC based distributed video coding , 2010, ACM Multimedia.

[52]  Bevan M. Baas,et al.  A 1080p H.264/AVC Baseline Residual Encoder for a Fine-Grained Many-Core System , 2011, IEEE Transactions on Circuits and Systems for Video Technology.