End-to-End Learning of Compressible Features

Pre-trained convolutional neural networks (CNNs) are powerful off-the-shelf feature generators and have been shown to perform very well on a variety of tasks. Unfortunately, the generated features are high dimensional and expensive to store: potentially hundreds of thousands of floats per example when processing videos. Traditional entropy based lossless compression methods are of little help as they do not yield desired level of compression, while general purpose lossy compression methods based on energy compaction (e.g. PCA followed by quantization and entropy coding) are sub-optimal, as they are not tuned to task specific objective. We propose a learned method that jointly optimizes for compressibility along with the task objective for learning the features. The plug-in nature of our method makes it straight-forward to integrate with any target objective and trade-off against compressibility. We present results on multiple benchmarks and demonstrate that our method produces features that are an order of magnitude more compressible, while having a regularization effect that leads to a consistent improvement in accuracy.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Glen G. Langdon,et al.  Universal modeling and coding , 1981, IEEE Trans. Inf. Theory.

[4]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[5]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[6]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Valero Laparra,et al.  End-to-end optimization of nonlinear transform codes for perceptual quality , 2016, 2016 Picture Coding Symposium (PCS).

[8]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jitendra Malik,et al.  Analyzing the Performance of Multilayer Neural Networks for Object Recognition , 2014, ECCV.

[10]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[11]  Hanjiang Lai,et al.  Simultaneous feature learning and hash coding with deep neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Jonathan Tompson,et al.  Towards Accurate Multi-person Pose Estimation in the Wild , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  David Minnen,et al.  Variational image compression with a scale hyperprior , 2018, ICLR.

[15]  Yiannis Andreopoulos,et al.  Video Classification With CNNs: Using the Codec as a Spatio-Temporal Activity Sensor , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Alexander A. Alemi,et al.  Deep Variational Information Bottleneck , 2017, ICLR.

[17]  R. Venkatesh Babu,et al.  H.264 compressed video classification using Histogram of Oriented Motion Vectors (HOMV) , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[19]  Jan van Leeuwen,et al.  On the Construction of Huffman Trees , 1976, ICALP.

[20]  Apostol Natsev,et al.  YouTube-8M: A Large-Scale Video Classification Benchmark , 2016, ArXiv.

[21]  Hanjiang Lai,et al.  Supervised Hashing for Image Retrieval via Image Representation Learning , 2014, AAAI.

[22]  S. Shankar Sastry,et al.  High-Speed Action Recognition and Localization in Compressed Domain Videos , 2008, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[24]  Shiguang Shan,et al.  Deep Supervised Hashing for Fast Image Retrieval , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Naftali Tishby,et al.  The information bottleneck method , 2000, ArXiv.

[28]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[29]  Luc Van Gool,et al.  Towards Image Understanding from Deep Compression without Decoding , 2018, ICLR.

[30]  Hayder Radha,et al.  On Hyperspectral Classification in the Compressed Domain , 2015, ArXiv.

[31]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[32]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[33]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Cees Snoek,et al.  What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[36]  Ivan Laptev,et al.  Efficient Feature Extraction, Encoding, and Classification for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Abdelhak M. Zoubir,et al.  Adaptive Compressed Classification for hyperspectral imagery , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).