Optimizing Transcoder Quality Targets Using a Neural Network with an Embedded Bitrate Model

Like all modern internet-based video services, YouTube employs adaptive bitrate (ABR) streaming. Due to the computational expense of transcoding, the goal is to achieve a target bitrate for each ABR segment, without requiring multi-pass encoding. We extend the content-dependent model equation between bitrate and frame rate [6] to include CRF and frame size. We then attempt to estimate the content-dependent parameters used in the model equation, using simple summary features taken from the video segment and a novel neural-network layout. We show that we can estimate the correct quality-control parameter on 65% of our test cases without using a previous transcode of the video segment. If there is a previous transcode of the same segment available (using an inexpensive configuration), we increase our accuracy to 80%. Motivation and Background Like other over-the-top (OTT) media services, YouTube uses ABR streaming to make the most of the available bandwidth at the user’s client player. ABR streaming (e.g., MPEG DASH [9]) allows the client to switch between alternate streams (representations in DASH) which encode the same content at different bitrates. Each representation is encoded such that short duration temporal segments are independently decodable. The option to switch between the different bitrate streams is therefore available only at the end of these segments. Target bitrates for each representation are chosen so that the user perceives a smoothly varying stream quality as the bitrate varies. For large OTT sites, transcoding to support these schemes is extremely resource intensive. This is due both to the sheer volume of video that must be transcoded1 and to the fact that multiple representations must be created from each single input file. A simple codec-agnostic technique for increasing throughput in proportion to available computational resources is to split each input clip into a number of segments which are then encoded in parallel. For DASH compliant streams, each encoder operates under the constraint that the bitrate is less than some specified maximum. Unfortunately, this parallel encoding process can result in artifacts that manifest as a large discontinuity between the picture quality at the start and at the end of the segment. The changes in picture quality is a result of a single pass attempt to achieve a given average bitrate over each segment independently. The transcoder is unable to correctly estimate the quality settings for the required bitrate and adjusts the encoding as it goes. As this happens on each segment, the viewer observes this as a cycle of †Martı́n Arjovsky is currently at the University of Buenos Aires, Argentina. 1At YouTube, 300 hours of video is uploaded every minute of every day [11]. picture quality from bad to good at intervals equal to the segment duration. The problem is exacerbated when segments are short (on the order of seconds) which is vital for low-latency cloud-based applications like YouTube. At the core, this problem is due to issues with rate control in the encoding process. This could be mitigated by propagating transcoding statistics (features) for each segment through the system. However, in a parallel encoding system, we wish to minimize or eliminate the need for information to be communicated between processing nodes, to allow wide (and independent) deployment across general-purpose CPU farms. Communication between these separate jobs would greatly complicate their deployment and would increase the impact of isolated transcoding slow-downs or failures. By deploying completely independent transcoding jobs, each job can be restarted without any of the other transcoded segments being affected. Equally important to our effort is to avoid making deep changes in any part of the codec implementation. We need to be able to treat our codec implementation as a commodity, to be replaced and upgraded as better versions or implementations become available. For this reason, our approach sets a single segment-level rate-control parameter. If we were, instead, to modify with pictureor macroblock-level quantization processes, our system would require continual maintenance as the deployed codec changed. By remaining at the level of a single external parameter, one that has a clear tie to a basic property of the transcode (the quality/bitrate control), the most that we will need to do upon changing codecs will be to retrain our neural-network-based control process. Within those constraints, our goal is to achieve stable frame quality throughout each segment as well as the desired bitrate for the segment as a whole. We consider the use of the x264.org codec to produce DASH compliant streams in a cloud-based environment. x264 is now the most common open source codec used in the video streaming industry, has a high throughput and is a reference for high performance. Multi-pass constant bitrate encoding would appear to satisfy the quality requirement. But, as we have shown in previous work [5], this is only successful if some effort is expended on optimising codec-specific quality settings at each pass. Given that multi-pass encoding clearly increases the computational resources required to encode a segment, we wish to reduce or eliminate the need for multiple passes. This paper explores the use of a neural network for predicting the parameters of a model that relates bitrate to various video properties. We show that we can estimate the correct codec parameter on 65% of our test cases without using a previous transcode of the video segment. If there is a previous transcode of the same segment available (using an inexpensive codec configu©2016 Society for Imaging Science and Technology DOI: 10.2352/ISSN.2470-1173.2016.2.VIPC-237 IS&T International Symposium on Electronic Imaging 2016 Visual Information Processing and Communication VII VIPC-237.1 Figure 1: Distributions of logK estimates, as found by NNLS fit per video block (blue) and by the mid-network layer (red). (mean: 6.15; std: 1.44; min: 0.22; max: 12.84) Figure 2: Distributions of a estimates, as found by NNLS fit per video block (blue) and by the mid-network layer (red). (mean: 0.126; std: 0.034; min: 0.04; max: 0.257) Figure 3: Distributions of d estimates, as found by NNLS fit per video block (blue) and by the mid-network layer (red). (mean: 1.57; std: 0.23; min: 0.55; max: 2.65) ration), we increase our accuracy to 80%. The next section discusses the key points with respect to rate control for each segment. We then go on to introduce the models used and to present the results of our neural-network training. Transcoding Configurations and Parameters We explore resolution-dependent ABR in which each representation is at a different resolution and a different bitrate. In this formulation when the client player switches representations, the stream changes resolution and bitrate. The spatial resolution of each video representation is easy to control: we simply tell the transcoder our target resolution. The bitrate of each video segment is more difficult to control in a way that provides the best output quality. This is especially true when doing single-pass compression. In x264 single-pass transcoding, the best (perceptual) video quality (for the bandwidth) is achieved by controlling the quantization levels indirectly, using what is called the Constant Rate Factor (CRF).2 Using CRF has the advantage that it adjusts the quantization parameters to take advantage of motion masking. The general idea is that mistakes are most noticable in smooth regions with good inter-frame prediction, so the CRF spends more of its bits on these regions [7]. Unfortunately, with CRF, there is no direct control of the actual bitrate that is used over the segment of video that we are transcoding. The same CRF parameter settings will yield widely different bitrates, when applied to different videos or even to different segments within a single video. Since we require a single transcode per resolution and a known target bitrate (±20%), we need to estimate the relationship between the bitrate and CRF for each video segment before we start transcoding that segment. In the next section, we discuss a model to relate CRF to bitrate, and help us towards this goal. Modeling the Effect of Transcoding Parameters on Bitrates In their 2012 article, Ma et al. [6] found that, for a given segment of video and a fixed frame size, they could accurately predict the bitrate by relating it to the frame rate and the quantization step size using the equation R(q, t,v) = Rmax(qmin, tmax,v) ( q qmin )−α(v)( t tmax )β (v) where qmin, tmax, and Rmax(qmin, tmax,v) are all taken from a previous transcode of the same video material, v, and are the previous transcode’s quantization step size, frame rate, and bitrate, respectively. The values of α(v), β (v), and Rmax(qmin, tmax,v) are transcoder and content dependent but independent of the desired quantization step size (q) and frame rate (t). On the test sets used in [6], the values of α(v) and β (v) generally varied by 70% or less, while the value for Rmax changed by as much as 13 times, depending on content. We have discovered a similar relationship between bitrate, CRF, frame resolution, and frame rate: logR(c, t,h,v) = logK(v)−a(v)c+b(v) log t +d(v) logh (1) 2Throughout this paper, the acronym CRF will be used used to refer to this compression parameter and not Conditional Random Field. ©2016 Society for Imaging Science and Technology DOI: 10.2352/ISSN.2470-1173.2016.2.VIPC-237 IS&T International Symposium on Electronic Imaging 2016 Visual Information Processing and Communication VII VIPC-237.2 Figure 4: Quality of fit of estimated logR(v) using NNLS-fit values for logK(v), a(v), and d(v). (Pearson’s correlation: 0.9984. Error std.: 0.1. Max error: 1.41) where R(c, t,h,v) is the predicted bitrate for a segment of video v, given the requested CRF setting, c; the requested frame rate, t; the requested frame height, h,3 a