CPU Microarchitectural Performance Characterization of Cloud Video Transcoding

Video streaming accounts for more than 75% of all Internet traffic. Videos streamed to end-users are encoded to reduce their size in order to efficiently use the Internet traffic, and are decoded when played at end-users' devices. Videos have to be transcoded-i.e., where one encoding format is converted to another-to fit users' different needs of resolution, framerate and encoding format. Global streaming service providers (e.g., YouTube, Netflix, and Facebook) employ a large number of transcoding operations. Optimizing the performance of transcoding to provide speedup of a few percent can save millions of dollars in computational and energy costs. While prior works identified microarchitectural characteristics of the transcoding operation for different classes of videos, other parameters of video transcoding and their impact on CPU performance has yet to be studied. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. In this work, we investigate the microarchitectural performance of video transcoding with all videos from vbench, a publicly available cloud video benchmark suite. We profile the leading multimedia transcoding software, FFmpeg with all of its major configurable parameters across videos with different complexity (e.g., videos with high motion and frequent scene transition are more complex). Based on our profiling results, we find key bottlenecks in instruction cache, data cache, and branch prediction unit for video transcoding workloads. Moreover, we observe that these bottlenecks vary widely in response to variation in transcoding parameters. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time. We leverage several state-of-the-art compiler approaches to mitigate performance bottlenecks of video transcoding operations. We apply AutoFDO, a feedback-directed optimization (FDO) tool to improve instruction cache and branch prediction performance. To improve data cache performance, we leverage Graphite, a polyhedral optimizer. Across all videos, AutoFDO and Graphite provide average speedups of 4.66% and 4.42% respectively. We also set up simulation settings with different microarchitecture configurations, and explore the potential improvement using a smart scheduler that assigns transcoding tasks to the best-fit configuration based on transcoding parameter values. The smart scheduler performs 3.72% better than the random scheduler and matches the performance of the best scheduler 75% of the time.

[1]  Liang-Gee Chen,et al.  Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra frame coder , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Gerassimos D. Barlas Cluster-based optimized parallel video transcoding , 2012, Parallel Comput..

[3]  John D. Villasenor,et al.  Trellis-based R-D optimal quantization in H.263+ , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[4]  Ye Wang,et al.  A workload prediction model for decoding mpeg video and its application to workload-scalable transcoding , 2007, ACM Multimedia.

[5]  Ahmad Yasin,et al.  A Top-Down method for performance analysis and counters architecture , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[6]  Vyas Sekar,et al.  CFA: A Practical Prediction System for Video QoE Optimization , 2016, NSDI.

[7]  Zhenyun Zhuang,et al.  Building cloud-ready video transcoding system for Content Delivery Networks (CDNs) , 2012, 2012 IEEE Global Communications Conference (GLOBECOM).

[8]  Christian Timmerer,et al.  Dynamic adaptive streaming over HTTP dataset , 2012, MMSys '12.

[9]  Srinivasan Seshan,et al.  Developing a predictive model of quality of experience for internet video , 2013, SIGCOMM.

[10]  Edward A. Lee,et al.  AWStream: adaptive wide-area streaming analytics , 2018, SIGCOMM.

[11]  Ming-Ting Sun,et al.  Digital Video Transcoding , 2005, Proceedings of the IEEE.

[12]  Nicolas Vasilache,et al.  GRAPHITE : Polyhedral Analyses and Optimizations for GCC , 2006 .

[13]  Yi Sun,et al.  CS2P: Improving Video Bitrate Selection and Adaptation with Data-Driven Throughput Prediction , 2016, SIGCOMM.

[14]  Stijn Eyerman,et al.  An Evaluation of High-Level Mechanistic Core Models , 2014, ACM Trans. Archit. Code Optim..

[15]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[16]  Anne Aaron,et al.  A large-scale video codec comparison of x264, x265 and libvpx for practical VOD applications , 2016, Optical Engineering + Applications.

[17]  Alvin Cheung,et al.  Perceptual Compression for Video Storage and Processing Systems , 2019, SoCC.

[18]  Xinfeng Zhang,et al.  Parallelizing video transcoding with load balancing on cloud computing , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[19]  Won Woo Ro,et al.  Accelerating HEVC transcoder by exploiting decoded quadtree , 2014, The 18th IEEE International Symposium on Consumer Electronics (ISCE 2014).

[20]  Claudio Meani,et al.  GPU-accelerated Video Transcoding Unit for Multi-access Edge Computing Scenarios , 2017, ICON 2017.

[21]  Ramesh K. Sitaraman,et al.  Optimizing the video transcoding workflow in content delivery networks , 2015, MMSys.

[22]  Bruno Sinopoli,et al.  A Control-Theoretic Approach for Dynamic Adaptive Video Streaming over HTTP , 2015, Comput. Commun. Rev..

[23]  Philip Levis,et al.  Learning in situ: a randomized experiment in video streaming , 2019, NSDI.

[24]  Xianguo Zhang,et al.  Fast and Efficient Transcoding Based on Low-Complexity Background Modeling and Adaptive Block Classification , 2013, IEEE Transactions on Multimedia.

[25]  Karthik Dantu,et al.  Frame-based dynamic voltage and frequency scaling for a MPEG decoder , 2002, ICCAD 2002.

[26]  Gang Liu,et al.  Cloud transcoder: bridging the format and resolution gap between internet videos and mobile devices , 2012, NOSSDAV '12.

[27]  Loren Merritt,et al.  X264: A HIGH PERFORMANCE H.264/AVC ENCODER , 2006 .

[28]  Parthasarathy Ranganathan,et al.  vbench: Benchmarking Video Transcoding in the Cloud , 2018, ASPLOS.

[29]  Xi Liu,et al.  C3: Internet-Scale Control Plane for Video Quality Optimization , 2015, NSDI.

[30]  Tipp Moseley,et al.  AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications , 2016, 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[31]  Bingsheng He,et al.  QoS-Aware Resource Allocation for Video Transcoding in Clouds , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Chia-Wen Lin,et al.  Fast algorithms for DCT-domain video transcoding , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[33]  Yaning Liu,et al.  Dynamic adaptive streaming over CCN: A caching and overhead analysis , 2013, 2013 IEEE International Conference on Communications (ICC).