Transmitting What Matters: Task-Oriented Video Composition and Compression

We present a simple yet effective framework – Transmitting What Matters (TWM) – to generate compressed videos containing only relevant objects targeted to specific computer vision tasks, such as faces for the task of face expression recognition, license plates for the task of optical character recognition, among others. TWM takes advantage of the final desired computer vision task to compose video frames only with the necessary data. The video frames are compressed and can be stored or transmitted to powerful servers where extensive and time-consuming tasks can be performed. We experimentally present the trade-offs between distortion and bitrate for a wide range of compression levels, and the impact generated by compression artifacts on the accuracy of the desired vision task. We show that, for one selected computer vision task, it is possible to dramatically reduce the amount of required data to be stored or transmitted, without compromising accuracy.

[1]  Jun Sun,et al.  Rate-distortion Optimized Trellis-Coded Quantization , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[2]  Shai Avidan,et al.  Ensemble Tracking , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  David Flynn,et al.  HEVC Complexity and Implementation Analysis , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[4]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[6]  Koen E. A. van de Sande,et al.  Evaluating Color Descriptors for Object and Scene Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Guilherme Corrêa,et al.  Performance and Computational Complexity Assessment of High-Efficiency Video Encoders , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Gary J. Sullivan,et al.  Comparison of the Coding Efficiency of Video Coding Standards—Including High Efficiency Video Coding (HEVC) , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Fernando De la Torre,et al.  Selective Transfer Machine for Personalized Facial Action Unit Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[11]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Touradj Ebrahimi,et al.  Perceptual Video Compression: A Survey , 2012, IEEE Journal of Selected Topics in Signal Processing.

[13]  Wei Tsang Ooi,et al.  Supporting zoomable video streams with dynamic region-of-interest cropping , 2010, MMSys '10.

[14]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[16]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[17]  Yongdong Zhang,et al.  High Efficiency Video Coding: High Efficiency Video Coding , 2014 .

[18]  Gary Bradski,et al.  Computer Vision Face Tracking For Use in a Perceptual User Interface , 1998 .

[19]  Ajay Luthra,et al.  Overview of the H.264/AVC video coding standard , 2003, IEEE Trans. Circuits Syst. Video Technol..

[20]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[21]  Ofer Hadar,et al.  Region-of-Interest Processing and Coding Techniques: Overview of Recent Trends and Directions , 2013 .

[22]  Itu-T and Iso Iec Jtc Advanced video coding for generic audiovisual services , 2010 .

[23]  Rainer Lienhart,et al.  An extended set of Haar-like features for rapid object detection , 2002, Proceedings. International Conference on Image Processing.

[24]  Gary J. Sullivan,et al.  Overview of the High Efficiency Video Coding (HEVC) Standard , 2012, IEEE Transactions on Circuits and Systems for Video Technology.

[25]  Christopher Bulla,et al.  Region of Interest Encoding in Video Conference Systems , 2013, MMEDIA 2013.

[26]  Mario Gerla,et al.  Adaptive video streaming: pre-encoded MPEG-4 with bandwidth scaling , 2004, Comput. Networks.

[27]  Fernando De la Torre,et al.  Supervised Descent Method and Its Applications to Face Alignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Shengxi Li,et al.  Region-of-Interest Based Conversational HEVC Coding with Hierarchical Perception Model of Face , 2014, IEEE Journal of Selected Topics in Signal Processing.

[29]  Wei Tsang Ooi,et al.  Adaptive encoding of zoomable video streams based on user access pattern , 2012, Signal Process. Image Commun..

[30]  F. Bossen,et al.  Common test conditions and software reference configurations , 2010 .