Fast and parallel video encoding by workload balancing

Today's video coding/decoding technology captures a wide area of applications such as phone/conferencing, interactive TV and many audio-video services. Ideally, the coding of the video should be fast enough to offer real-time performance (>24 f/s). However, the inherent computing complexity of some of the coding components including motion estimation, discrete cosine transform and variable length entropy coding, means that fast implementation on parallel computing platform is potentially fruitful. Over the years, results have been reported on the implementation of parallel MPEG and H.261 encoders, where spatial or temporal data parallelism is commonly exploited. Most of these methods decomposed a fixed number of macroblocks (MB) in an arbitrary sense. As the MB's delays are different because of motion content, this approach introduces uneven workload across the processors, causing long critical path and poor utilization of the processors. In this paper, we explore the issue of balancing the MB computing workload across the processors. This includes first, the prediction of the workload based on the previous frame workload, and second, the scheduling of the MB bounded by the locality constraint (Fig. 6). The algorithm was implemented on an IBM SP2, and the results showed that the reduction in the worst case delay is around 19-23%, with both the prediction and scheduling overhead taken into account (Fig. 9b). Because of the critical path reduction, the overall processor utilization was increased, and the overall coding rate improved.