On Learning Where To Look

Current automatic vision systems face two major challenges: scalability and extreme variability of appearance. First, the computational time required to process an image typically scales linearly with the number of pixels in the image, therefore limiting the resolution of input images to thumbnail size. Second, variability in appearance and pose of the objects constitute a major hurdle for robust recognition and detection. In this work, we propose a model that makes baby steps towards addressing these challenges. We describe a learning based method that recognizes objects through a series of glimpses. This system performs an amount of computation that scales with the complexity of the input rather than its number of pixels. Moreover, the proposed method is potentially more robust to changes in appearance since its parameters are learned in a data driven manner. Preliminary experiments on a handwritten dataset of digits demonstrate the computational advantages of this approach.

[1]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[2]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4]  Yann LeCun,et al.  Synergistic Face Detection and Pose Estimation with Energy-Based Models , 2004, J. Mach. Learn. Res..

[5]  Jitendra Malik,et al.  An Information Maximization Model of Eye Movements , 2004, NIPS.

[6]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[7]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[8]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[9]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[10]  Garrison W. Cottrell,et al.  Robust classification of objects, faces, and flowers using natural image statistics , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[11]  David A. McAllester,et al.  Cascade object detection with deformable part models , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Marc'Aurelio Ranzato,et al.  Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition , 2010, ArXiv.

[13]  Eero P. Simoncelli,et al.  Metamers of the ventral stream , 2011, Nature Neuroscience.

[14]  J. Erichsen,et al.  Human and animal vision , 2012 .

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Lihi Zelnik-Manor,et al.  Context-Aware Saliency Detection , 2012, IEEE Trans. Pattern Anal. Mach. Intell..