Warehouse-scale video acceleration: co-design and deployment in the wild

Video sharing (e.g., YouTube, Vimeo, Facebook, TikTok) accounts for the majority of internet traffic, and video processing is also foundational to several other key workloads (video conferencing, virtual/augmented reality, cloud gaming, video in Internet-of-Things devices, etc.). The importance of these workloads motivates larger video processing infrastructures and – with the slowing of Moore’s law – specialized hardware accelerators to deliver more computing at higher efficiencies. This paper describes the design and deployment, at scale, of a new accelerator targeted at warehouse-scale video transcoding. We present our hardware design including a new accelerator building block – the video coding unit (VCU) – and discuss key design trade-offs for balanced systems at data center scale and co-designing accelerators with large-scale distributed software systems. We evaluate these accelerators “in the wild" serving live data center jobs, demonstrating 20-33x improved efficiency over our prior well-tuned non-accelerated baseline. Our design also enables effective adaptation to changing bottlenecks and improved failure management, and new workload capabilities not otherwise possible with prior systems. To the best of our knowledge, this is the first work to discuss video acceleration at scale in large warehouse-scale environments.

Parthasarathy Ranganathan | In Suk Chong | Alex Ramirez | Sandeep Bhatia | Daniel Stodolsky | Andrew C. Walton | Don Stark | Aki Kuusela | Marisabel Guevara | Poonacha Kongetira | Jeremy Dorfman | Ben Gelb | Narayana Penukonda | Amir Salek | Mercedes Tan | Aaron Laursen | Devin Persaud | Rob Springer | Srikanth Muroor | Indira Jayaram | Jeff Calow | Clinton Wills Smullen IV | Raghu Balasubramanian | Prakash Chauhan | Anna Cheung | Niranjani Dasharathi | Jia Feng | Brian Fosco | Samuel Foss | Sara J. Gwin | Yoshiaki Hase | Da-ke He | C. Richard Ho | Roy W. Huffman Jr. | Elisha Indupalli | Cho Mon Kyaw | Yuan Li | Fong Lou | Kyle A. Lucke | JP Maaninen | Ramon Macias | Maire Mahony | David Alexander Munday | Eric Perkins-Argueta | Ville-Mikko Rautio | Yolanda Ripley | Sathish Sekar | Sergey N. Sokolov | Mark S. Wachsler | David A. Wickeraad | Alvin Wijaya | Hon Kwan Wu | Ben Gelb | C. R. Ho | Maire Mahony | Narayana Penukonda | Amir Salek | Mercedes Tan | P. Ranganathan | A. Ramírez | Honggang Wu | Rob Springer | Daniel Stodolsky | Mark Wachsler | Andrew Walton | D. Stark | D. Wickeraad | Marisabel Guevara | S. Bhatia | S. Sokolov | D. Persaud | P. Kongetira | Jeff Calow | Jeremy Dorfman | Aki Kuusela | Raghu Balasubramanian | Prakash Chauhan | Anna Cheung | Niranjani Dasharathi | Jia Feng | Brian Fosco | Samuel Foss | Yoshiaki Hase | Dazhi He | Elisha Indupalli | I. Jayaram | Aaron Laursen | Yuan Li | Fong Lou | JP Maaninen | Ramon Macias | D. Munday | S. Muroor | Eric Perkins-Argueta | Ville Rautio | Yolanda Ripley | Sathish Sekar | A. Wijaya | Parthasarathy Ranganathan | M. Mahony | Alex Ramírez | Srikanth Muroor

[1]  Gary J. Sullivan,et al.  Video Compression - From Concepts to the H.264/AVC Standard , 2005, Proceedings of the IEEE.

[2]  Tomas Kratochvil,et al.  Software and hardware HEVC encoding , 2017, 2017 International Conference on Systems, Signals and Image Processing (IWSSIP).

[3]  J. Hennessy A new golden age for computer architecture: Domain-specific hardware/software co-design, enhanced security, open instruction sets, and agile chip development , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[4]  Konstantin Serebryany,et al.  MemorySanitizer: Fast detector of uninitialized memory use in C++ , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[5]  G. Bjontegaard,et al.  Calculation of Average PSNR Differences between RD-curves , 2001 .

[6]  Lieven Eeckhout,et al.  Scheduling heterogeneous multi-cores through performance impact estimation (PIE) , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[7]  Itu-T and Iso Iec Jtc Advanced video coding for generic audiovisual services , 2010 .

[8]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[9]  Songqing Chen,et al.  The stretched exponential distribution of internet media access patterns , 2008, PODC '08.

[10]  Lingjia Tang,et al.  Whare-map: heterogeneity in "homogeneous" warehouse-scale computers , 2013, ISCA.

[11]  Andrés Takach High-Level Synthesis: Status, Trends, and Future Directions , 2016, IEEE Des. Test.

[12]  Parthasarathy Ranganathan,et al.  vbench: Benchmarking Video Transcoding in the Cloud , 2018, ASPLOS.

[13]  Wen Gao,et al.  A flexible and high-performance hardware video encoder architecture , 2012, 2012 Picture Coding Symposium.

[14]  Akshitha Sriraman,et al.  Accelerometer: Understanding Acceleration Opportunities for Data Center Overheads at Hyperscale , 2020, ASPLOS.

[15]  Keith Winstein,et al.  Salsify: Low-Latency Network Video through Tighter Integration between a Video Codec and a Transport Protocol , 2018, NSDI.

[16]  Detlev Marpe,et al.  Performance comparison of H.265/MPEG-HEVC, VP9, and H.264/MPEG-AVC encoders , 2013, 2013 Picture Coding Symposium (PCS).

[17]  Stacey Jeffery,et al.  HASS: a scheduler for heterogeneous multicore systems , 2009, OPSR.

[18]  Bruce M. Maggs,et al.  Globally Distributed Content Delivery , 2002, IEEE Internet Comput..

[19]  Jan De Cock,et al.  Compression Performance Comparison of x264, x265, libvpx and aomenc for On-Demand Adaptive Streaming Applications , 2018, 2018 Picture Coding Symposium (PCS).

[20]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[21]  Andrew Chi-Chih Yao,et al.  Resource Constrained Scheduling as Generalized Bin Packing , 1976, J. Comb. Theory A.

[22]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[23]  Jingning Han,et al.  A Non-local Mean Temporal Filter for Video Compression , 2020, 2020 IEEE International Conference on Image Processing (ICIP).

[24]  Gu-Yeon Wei,et al.  Profiling a warehouse-scale computer , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[25]  Christopher Edwards,et al.  Adaptive Bitrate Selection: A Survey , 2017, IEEE Communications Surveys & Tutorials.

[26]  Peter H. Westerink,et al.  Two-pass MPEG-2 variable-bit-rate encoding , 1999, IBM J. Res. Dev..

[27]  Yu Wang,et al.  Software-Hardware Codesign for Efficient Neural Network Acceleration , 2017, IEEE Micro.

[28]  L. V. Gutierrez,et al.  ASIC Clouds: Specializing the Datacenter , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[29]  Edith Beigné,et al.  H.264/AVC hardware encoders and low-power features , 2014, 2014 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS).

[30]  Antonio Ortega,et al.  Rate-distortion methods for image and video compression , 1998, IEEE Signal Process. Mag..

[31]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[32]  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[33]  Liang-Gee Chen,et al.  Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra frame coder , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Jacob Savir,et al.  Built In Test for VLSI: Pseudorandom Techniques , 1987 .

[35]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[36]  Anirudh Sivaraman,et al.  Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads , 2017, NSDI.

[37]  Debargha Mukherjee,et al.  The latest open-source video codec VP9 - An overview and preliminary results , 2013, 2013 Picture Coding Symposium (PCS).

[38]  Yanjiao Chen,et al.  From QoS to QoE: A Tutorial on Video Quality Assessment , 2015, IEEE Communications Surveys & Tutorials.

[39]  W. Badawy,et al.  A design flow for an H.264 embedded video encoder , 2005, 2005 International Conference on Information and Communication Technology.

[40]  Christina Delimitrou,et al.  Mage: online and interference-aware scheduling for multi-scale heterogeneous systems , 2018, PACT.

[41]  Carsten Griwodz,et al.  Using a Commodity Hardware Video Encoder for Interactive Video Streaming , 2014, 2014 IEEE International Symposium on Multimedia.

[42]  Heiko Schwarz,et al.  Hybrid Video Coding with Trellis-Coded Quantization , 2019, 2019 Data Compression Conference (DCC).

[43]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[44]  Liang-Gee Chen,et al.  Hardware architecture design of an H.264/AVC video codec , 2006, Asia and South Pacific Conference on Design Automation, 2006..

[45]  Yo-Sung Ho,et al.  Error concealment techniques for digital TV , 2002, IEEE Trans. Broadcast..

[46]  Derek Bruening,et al.  AddressSanitizer: A Fast Address Sanity Checker , 2012, USENIX Annual Technical Conference.

[47]  Albert G. Greenberg,et al.  Detection and Localization of Network Black Holes , 2007, IEEE INFOCOM 2007 - 26th IEEE International Conference on Computer Communications.

[48]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[49]  Anil C. Kokaram,et al.  Encoding Bitrate Optimization Using Playback Statistics for HTTP-based Adaptive Video Streaming , 2017, ArXiv.

[50]  Grzegorz Pastuszak High-speed architecture of the CABAC probability modeling for H.265/HEVC encoders , 2016, 2016 International Conference on Signals and Electronic Systems (ICSES).

[51]  Martha A. Kim,et al.  vbench: Benchmarking Video Transcoding in the Cloud , 2018, ASPLOS.