Computer Vision – ECCV 2018

Attention mechanisms in biological perception are thought to select subsets of perceptual information for more sophisticated processing which would be prohibitive to perform on all sensory inputs. In computer vision, however, there has been relatively little exploration of hard attention, where some information is selectively ignored, in spite of the success of soft attention, where information is re-weighted and aggregated, but never filtered out. Here, we introduce a new approach for hard attention and find it achieves very competitive performance on a recentlyreleased visual question answering datasets, equalling and in some cases surpassing similar soft attention architectures while entirely ignoring some features. Even though the hard attention mechanism is thought to be non-differentiable, we found that the feature magnitudes correlate with semantic relevance, and provide a useful signal for our mechanism’s attentional selection criterion. Because hard attention selects important features of the input information, it can also be more efficient than analogous soft attention mechanisms. This is especially important for recent approaches that use non-local pairwise operations, whereby computational and memory costs are quadratic in the size of the set of features.

[1]  M. Veloso,et al.  Planar polygon extraction and merging from depth images , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1992 .

[4]  Ling Shao,et al.  Computer Vision and Machine Learning with RGB-D Sensors , 2014, Advances in Computer Vision and Pattern Recognition.

[5]  El-Houssine Bouyakhf,et al.  Planes Detection for Robust Localization and Mapping in RGB-D SLAM Systems , 2015, 2015 International Conference on 3D Vision.

[6]  Jian-Huang Lai,et al.  RGB-Infrared Cross-Modality Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[8]  Reinhard Klein,et al.  Completion and Reconstruction with Primitive Shapes , 2009, Comput. Graph. Forum.

[9]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10]  Roberto Manduchi,et al.  Bilateral filtering for gray and color images , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[11]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Vineet R. Kamat,et al.  Fast plane extraction in organized point clouds using agglomerative hierarchical clustering , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Michael Schmeing,et al.  Color Segmentation Based Depth Image Filtering , 2012, WDIA.

[14]  Juan D. Tardós,et al.  ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras , 2016, IEEE Transactions on Robotics.

[15]  Kun Zhou,et al.  Online Structure Analysis for Real-Time Indoor Scene Reconstruction , 2015, ACM Trans. Graph..

[16]  Jean Serra,et al.  Image Analysis and Mathematical Morphology , 1983 .

[17]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[20]  Paul H. J. Kelly,et al.  Dense planar SLAM , 2014, ISMAR.

[21]  Hui Lin,et al.  Depth image enhancement for Kinect using region growing and bilateral filter , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[22]  Michael Kaess,et al.  Simultaneous localization and mapping with infinite planes , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Dani Lischinski,et al.  Joint bilateral upsampling , 2007, SIGGRAPH 2007.

[24]  Deyu Meng,et al.  A Novel Group-Sparsity-Optimization-Based Feature Selection Model for Complex Interaction Recognition , 2014, ACCV.

[25]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[26]  Matthias Nießner,et al.  BundleFusion , 2016, TOGS.

[27]  Luis Salgado,et al.  Adaptive spatio-temporal filter for low-cost camera depth maps , 2012, 2012 IEEE International Conference on Emerging Signal Processing Applications.

[28]  Eric O. Postma,et al.  Light-weight pixel context encoders for image inpainting , 2018, ArXiv.

[29]  Minh N. Do,et al.  Depth Video Enhancement Based on Weighted Mode Filtering , 2012, IEEE Transactions on Image Processing.

[30]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[31]  Thomas A. Funkhouser,et al.  Dilated Residual Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  François Michaud,et al.  Online global loop closure detection for large-scale multi-session graph-based SLAM , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[33]  Tim Weyrich,et al.  Real-Time 3D Reconstruction in Dynamic Scenes Using Point-Based Fusion , 2013, 2013 International Conference on 3D Vision.

[34]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[35]  Tao Zhang,et al.  Robust RGB-D simultaneous localization and mapping using planar point features , 2015, Robotics Auton. Syst..

[36]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[37]  Daniel Cremers,et al.  De-noising, stabilizing and completing 3D reconstructions on-the-go using plane priors , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[38]  S. Umeyama,et al.  Least-Squares Estimation of Transformation Parameters Between Two Point Patterns , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Tamy Boubekeur,et al.  A Survey of Simple Geometric Primitives Detection Methods for Captured 3D Data , 2018, Comput. Graph. Forum.

[40]  Ghassan Al-Regib,et al.  Hierarchical Hole-Filling For Depth-Based View Synthesis in FTV and 3D Video , 2012, IEEE Journal of Selected Topics in Signal Processing.

[41]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Yue Gao,et al.  Exploiting Web Images for Semantic Video Indexing Via Robust Sample-Specific Loss , 2014, IEEE Transactions on Multimedia.

[43]  Leonidas J. Guibas,et al.  3Dlite: towards commodity 3D scanning for content creation , 2017, ACM Trans. Graph..

[44]  Yu-Chiang Frank Wang,et al.  Heterogeneous Domain Adaptation and Classification by Exploiting the Correlation Subspace , 2014, IEEE Transactions on Image Processing.

[45]  Sylvain Lefebvre,et al.  Compressed Random-Access Trees for Spatially Coherent Data , 2007, Rendering Techniques.

[46]  Michael F. Cohen,et al.  Emptying, refurnishing, and relighting indoor spaces , 2016, ACM Trans. Graph..

[47]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  David Tschumperlé,et al.  Superpixel-based depth map inpainting for RGB-D view synthesis , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[49]  Wolfram Burgard,et al.  3-D Mapping With an RGB-D Camera , 2014, IEEE Transactions on Robotics.

[50]  M. Kass,et al.  Smoothed local histogram filters , 2010, SIGGRAPH 2010.

[51]  Cyrill Stachniss,et al.  Effective compression of range data streams for remote robot operations using H.264 , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[52]  Cordelia Schmid,et al.  A Robust and Efficient Video Representation for Action Recognition , 2015, International Journal of Computer Vision.

[53]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Shanmuganathan Raman,et al.  An iterative, non-local approach for restoring depth maps in RGB-D images , 2015, 2015 Twenty First National Conference on Communications (NCC).

[55]  Guofeng Zhang,et al.  Keyframe-based dense planar SLAM , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[56]  Krista A. Ehinger,et al.  Recognizing scene viewpoint using panoramic place representation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[57]  Pavel Smrz,et al.  Continuous plane detection in point-cloud data based on 3D Hough Transform , 2014, J. Vis. Commun. Image Represent..

[58]  Ernesto Damiani,et al.  Temporal Denoising of Kinect Depth Data , 2012, 2012 Eighth International Conference on Signal Image Technology and Internet Based Systems.

[59]  Reinhard Klein,et al.  Efficient RANSAC for Point‐Cloud Shape Detection , 2007, Comput. Graph. Forum.

[60]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[61]  Paolo Cignoni,et al.  Metro: Measuring Error on Simplified Surfaces , 1998, Comput. Graph. Forum.

[62]  Jan-Michael Frahm,et al.  Exploring High-Level Plane Primitives for Indoor 3D Reconstruction with a Hand-held RGB-D Camera , 2012, ACCV Workshops.

[63]  Miao Xu,et al.  Hole-filling Based on Disparity Map and Inpainting for Depth- Image-Based Rendering , 2016 .

[64]  Yao Zhao,et al.  Cross-Modal Retrieval With CNN Visual Features: A New Baseline , 2017, IEEE Transactions on Cybernetics.

[65]  Michel Antunes,et al.  Plane-based Odometry using an RGB-D Camera , 2013, BMVC.

[66]  Nasser Kehtarnavaz,et al.  A computationally efficient denoising and hole-filling method for depth image enhancement , 2016, Photonics Europe.

[67]  Sven Behnke,et al.  Real-Time Plane Segmentation Using RGB-D Cameras , 2012, RoboCup.

[68]  Seung-Won Jung,et al.  Directional Joint Bilateral Filter for Depth Images , 2014, Sensors.

[69]  Matthias Nießner,et al.  Real-time 3D reconstruction at scale using voxel hashing , 2013, ACM Trans. Graph..

[70]  Andrew W. Fitzgibbon,et al.  KinectFusion: Real-time dense surface mapping and tracking , 2011, 2011 10th IEEE International Symposium on Mixed and Augmented Reality.