Putting It All Together: A Fully Parallel and Efficient H.264 Decoder

It previous chapters we have presented efficient and scalable parallelization strategies for different parts (stages) of H.264/AVC decoding. To obtain a fast and scalable parallel decoder, however, all stages need to be parallelized. In this chapter we will take the final step in our parallel application design process by putting together everything we learnt in the previous chapters in order to realize a highly efficient and scalable parallel application. Specifically, in this chapter we combine pipelining parallelism with data-level in the form of macroblock-level parallelism to obtain a fully parallel H.264 decoder that is optimized for core counts of future multicore systems and emerging video decoding scenarios. The presented implementation is evaluated on a 40-core cc-NUMA system using 1080p25 and 2160p50 video sequences.