Faster Neural Networks Straight from JPEG

The simple, elegant approach of training convolutional neural networks (CNNs) directly from RGB pixels has enjoyed overwhelming empirical success. But can more performance be squeezed out of networks by using different input representations? In this paper we propose and explore a simple idea: train CNNs directly on the blockwise discrete cosine transform (DCT) coefficients computed and available in the middle of the JPEG codec. Intuitively, when processing JPEG images using CNNs, it seems unnecessary to decompress a blockwise frequency representation to an expanded pixel representation, shuffle it from CPU to GPU, and then process it with a CNN that will learn something similar to a transform back to frequency representation in its first layers. Why not skip both steps and feed the frequency domain into the network directly? In this paper we modify \libjpeg to produce DCT coefficients directly, modify a ResNet-50 network to accommodate the differently sized and strided input, and evaluate performance on ImageNet. We find networks that are both faster and more accurate, as well as networks with about the same accuracy but 1.77x faster than ResNet-50.

[1]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  Rozenn Dahyot,et al.  On using CNN with DCT based Image Data , 2017 .

[4]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Michael Elad,et al.  Compressed Learning: A Deep Neural Network Approach , 2016, ArXiv.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Andrew Brock,et al.  Neural Photo Editing with Introspective Adversarial Networks , 2016, ICLR.

[8]  Martin D. Levine,et al.  Face Recognition Using the Discrete Cosine Transform , 2001, International Journal of Computer Vision.

[9]  Sethuraman Panchanathan,et al.  A critical evaluation of image and video indexing techniques in the compressed domain , 1999, Image Vis. Comput..

[10]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Alain Léger,et al.  JPEG at 25: Still Going Strong , 2017, IEEE MultiMedia.

[12]  Bo Shen,et al.  Direct feature extraction from compressed images , 1996, Electronic Imaging.

[13]  Luc Van Gool,et al.  Towards Image Understanding from Deep Compression without Decoding , 2018, ICLR.

[14]  Hod Lipson,et al.  Understanding Neural Networks Through Deep Visualization , 2015, ArXiv.

[15]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[16]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[17]  Alan C. Bovik,et al.  . Efficient DCT-domain blind measurement and reduction of blocking artifacts , 2002, IEEE Trans. Circuits Syst. Video Technol..

[18]  Mihai Datcu,et al.  A Similarity Metric for Retrieval of Compressed Objects: Application for Mining Satellite Image Time Series , 2008, IEEE Transactions on Knowledge and Data Engineering.

[19]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[20]  Jianmin Jiang,et al.  JPEG compressed image retrieval via statistical features , 2003, Pattern Recognit..

[21]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[22]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Neural Networks , 2013 .