Computer Vision – ECCV 2018

Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-grained differences? Currently, a network would first need to execute sometimes hundreds of intermediate layers that specialize in unrelated aspects. Ideally, the more a network already knows about an image, the better it should be at deciding which layer to compute next. In this work, we propose convolutional networks with adaptive inference graphs (ConvNetAIG) that adaptively define their network topology conditioned on the input image. Following a high-level structure similar to residual networks (ResNets), ConvNet-AIG decides for each input image on the fly which layers are needed. In experiments on ImageNet we show that ConvNet-AIG learns distinct inference graphs for different categories. Both ConvNet-AIG with 50 and 101 layers outperform their ResNet counterpart, while using 20% and 33% less computations respectively. By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality. Lastly, we also study the effect of adaptive inference graphs on the susceptibility towards adversarial examples. We observe that ConvNet-AIG shows a higher robustness than ResNets, complementing other known defense mechanisms.

[1]  Rajiv Soundararajan,et al.  Study of Subjective and Objective Quality Assessment of Video , 2010, IEEE Transactions on Image Processing.

[2]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Scott E. Reed,et al.  Weakly-supervised Disentangling with Recurrent Transformations for 3D View Synthesis , 2015, NIPS.

[4]  Jordan W. Suchow,et al.  Motion Silences Awareness of Visual Change , 2011, Current Biology.

[5]  Alan C. Bovik,et al.  Saliency Prediction on Stereoscopic Videos , 2014, IEEE Transactions on Image Processing.

[6]  Sanghoon Lee,et al.  Fully Deep Blind Image Quality Predictor , 2017, IEEE Journal of Selected Topics in Signal Processing.

[7]  Larry S. Davis,et al.  Visual Relationship Detection with Internal and External Linguistic Knowledge Distillation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Alan C. Bovik,et al.  Transfer Function Model of Physiological Mechanisms Underlying Temporal Visual Discomfort Experienced When Viewing Stereoscopic 3D Images , 2015, IEEE Transactions on Image Processing.

[9]  Dahua Lin,et al.  Recognize complex events from static images by fusing deep channels , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Hiroyuki Okada,et al.  Content-adaptive postfiltering for very low bit rate video , 1997, Proceedings DCC '97. Data Compression Conference.

[13]  Alan C. Bovik,et al.  Automatic Prediction of Perceptual Image and Video Quality , 2013, Proceedings of the IEEE.

[14]  Alan C. Bovik,et al.  A Completely Blind Video Integrity Oracle , 2016, IEEE Transactions on Image Processing.

[15]  Noah Snavely,et al.  Robust Global Translations with 1DSfM , 2014, ECCV.

[16]  J. M. Foley,et al.  Contrast masking in human vision. , 1980, Journal of the Optical Society of America.

[17]  Samy Bengio,et al.  Order Matters: Sequence to sequence for sets , 2015, ICLR.

[18]  Vincent Lepetit,et al.  LIFT: Learned Invariant Feature Transform , 2016, ECCV.

[19]  David S. Doermann,et al.  Unsupervised feature learning framework for no-reference image quality assessment , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Marios S. Pattichis,et al.  Foveated video quality assessment , 2002, IEEE Trans. Multim..

[21]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Roland Siegwart,et al.  Unified temporal and spatial calibration for multi-sensor systems , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Martial Hebert,et al.  An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders , 2016, ECCV.

[24]  Damon M. Chandler,et al.  A spatiotemporal most-apparent-distortion model for video quality assessment , 2011, 2011 18th IEEE International Conference on Image Processing.

[25]  Yi Yang,et al.  Articulated Human Detection with Flexible Mixtures of Parts , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Lai-Man Po,et al.  No-Reference Video Quality Assessment With 3D Shearlet Transform and Convolutional Neural Networks , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Thomas Brox,et al.  DeMoN: Depth and Motion Network for Learning Monocular Stereo , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jayakorn Vongkulbhisal,et al.  Discriminative Optimization: Theory and Applications to Point Cloud Registration , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Wei Zhang,et al.  Deep Kinematic Pose Regression , 2016, ECCV Workshops.

[30]  Xiaoming Liu,et al.  Coefficients Pose-Variant Input Recogni 8 on Engine Frontalized Output Generator FF-GAN D Discriminator Extreme Pose Input Frontalized Output , 2017 .

[31]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[32]  John M. Libert,et al.  Mosquito noise in MPEG-compressed video: test patterns and metrics , 2000, Electronic Imaging.

[33]  Carsten Rother,et al.  Fast cost-volume filtering for visual correspondence and beyond , 2011, CVPR 2011.

[34]  Alan C. Bovik,et al.  Video Quality Pooling Adaptive to Perceptual Distortion Severity , 2013, IEEE Transactions on Image Processing.

[35]  Zhengyou Zhang,et al.  Determining the Epipolar Geometry and its Uncertainty: A Review , 1998, International Journal of Computer Vision.

[36]  Patrick Le Callet,et al.  Considering Temporal Variations of Spatial Visual Distortions in Video Quality Assessment , 2009, IEEE Journal of Selected Topics in Signal Processing.

[37]  Sheila S. Hemami,et al.  VSNR: A Wavelet-Based Visual Signal-to-Noise Ratio for Natural Images , 2007, IEEE Transactions on Image Processing.

[38]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[39]  J. Robson Spatial and Temporal Contrast-Sensitivity Functions of the Visual System , 1966 .

[40]  Alan C. Bovik,et al.  Motion Tuned Spatio-Temporal Quality Assessment of Natural Videos , 2010, IEEE Transactions on Image Processing.

[41]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[42]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Marios S. Pattichis,et al.  Foveated video compression with optimal rate control , 2001, IEEE Trans. Image Process..

[44]  Christopher Zach,et al.  Robust Bundle Adjustment Revisited , 2014, ECCV.

[45]  Alan C. Bovik,et al.  Image information and visual quality , 2006, IEEE Trans. Image Process..

[46]  Ian D. Reid,et al.  Towards Context-Aware Interaction Recognition for Visual Relationship Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[49]  Honglak Lee,et al.  Attribute2Image: Conditional Image Generation from Visual Attributes , 2015, ECCV.

[50]  Namil Kim,et al.  Pixel-Level Domain Transfer , 2016, ECCV.

[51]  Sanja Fidler,et al.  Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Jongyoo Kim,et al.  Deep CNN-Based Blind Image Quality Predictor , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[53]  Vincent Lepetit,et al.  Learning to Find Good Correspondences , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Kate Saenko,et al.  Integrating Language and Vision to Generate Natural Language Descriptions of Videos in the Wild , 2014, COLING.

[55]  Lei Zhang,et al.  Deep Convolutional Neural Models for Picture-Quality Prediction: Challenges and Solutions to Data-Driven Image Quality Assessment , 2017, IEEE Signal Processing Magazine.

[56]  Tobi Delbruck,et al.  A 240 × 180 130 dB 3 µs Latency Global Shutter Spatiotemporal Vision Sensor , 2014, IEEE Journal of Solid-State Circuits.

[57]  Jiaolong Yang,et al.  Optimal Essential Matrix Estimation via Inlier-Set Maximization , 2014, ECCV.

[58]  Sanghoon Lee,et al.  Blind Deep S3D Image Quality Evaluation via Local to Global Feature Aggregation , 2017, IEEE Transactions on Image Processing.

[59]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[60]  Christophe Charrier,et al.  Blind Prediction of Natural Video Quality , 2014, IEEE Transactions on Image Processing.

[61]  Scott J. Daly,et al.  Visible differences predictor: an algorithm for the assessment of image fidelity , 1992, Electronic Imaging.

[62]  Sanghoon Lee,et al.  Deep Learning of Human Visual Sensitivity in Image Quality Assessment Framework , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).