Dictionary learning in stereo imaging

This paper presents a new method for learning overcomplete d ic ionaries adapted to efficient joint representation of st ereo images. We first formulate a sparse stereo image model where t multi-view correlation is described by local geometric t ansforms of dictionary atoms in two stereo views. A maximum-likeliho od method for learning stereo dictionaries is then proposed , which includes a multi-view geometry constraint in the probabili stic modeling in order to obtain dictionaries optimized for the joint representation of stereo images. The dictionaries are lear ned by optimizing the maximum-likelihood objective functi on using the expectation-maximization algorithm. We illustrate the le arning algorithm in the case of omnidirectional images, whe re we learn scales of atoms in a parametric dictionary. The resulting di ctionaries provide both better performance in the joint rep resentation of stereo omnidirectional images and improved multi-view f eature matching. We finally discuss and demonstrate the bene fits of dictionary learning for distributed scene representation and camera pose estimation. Index Terms Sparse approximations, dictionary learning, multi-view i maging, omnidirectional cameras. I. I NTRODUCTION Multiple images of a 3D scene taken from different viewpoint s contain information about both 3D structure and texture of the objects in the scene. Therefore, these images give a ri cher description of the environment compared to a single vie w. Multi-view images are usually captured by a network of camer as distributed in a 3D scene. Such visual sensor networks can find usage in applications like 3D television, surveillance , robotics or exploration. However, dealing with the high di mensional visual information still poses many challenges, such as mul ti-view compression, 3D geometry estimation and scene anal ysis. Extraction of 3D information from multiple views relies on t he theory of the multiple view geometry [1], which relates image features that represent the same 3D objects in differe nt vi ws. Pixel-based image representation is used in most o f the image-based 3D geometry estimation methods that build dens e depth maps by computing pixel correspondences. However, pixel-based representations are highly inefficient for ima ge coding and compression. On the other hand, image represen tations with orthogonal bases are efficient for compression, but gen erally fail to efficiently capture the geometry of objects in a scene and the correlation between views. Therefore, multi-view i maging requires new image representation methods that give good performance in both compression and scene geometry estimat ion. This paper addresses the problem of learning dictionaries a dapted to the representation of multi-view images. We consi der sparse image approximations with overcomplete dictionari es of geometrical atoms. As the correlation between multi-v iew images arises from the geometric constraints on the objects in the scene, it can be simply described by local transforms o f geometric atoms [2]. We propose to learn dictionaries that e ffici ntly describe the content of natural images and simult aneously permit to capture the geometric correlation between multiview images. Dictionary learning for sparse signal represe ntations has become an extremely active area of research in the last fe w years, when it was realized that adapting the dictionary to a specific task or imposing a certain structure to the diction ary can yield significant improvements of performance in tar get applications. Researchers have addressed the problem of le arning dictionaries for image [3]–[5] and video representa tion [6]– [8]. To the best of our knowledge there has been however no wor k on learning dictionaries for multi-view representation. We concentrate on the problem of two views and develop a maximum likelihood (ML) method for learning dictionaries that lead to improved image approximation under the sparsity prior, a nd t the same time give better multi-view geometry estimati on from sparse low-level visual features. Our method builds up on the ML method for learning overcomplete dictionaries fro m natural monocular images, introduced by Olshausen and Fiel [3]. Additionally, the proposed probabilistic approach t o learning includes the epipolar geometry in the modeling, and hence ma tches corresponding atoms within the learning process itse lf. The optimization problem is cast as an energy minimization p roblem, that we finally solve with an Expectation-Maximizat on (EM) algorithm. The experimental results show the significa nt benefits of stereo dictionary learning for applications s uch as distributed scene representation and camera pose recovery . The organization of this paper is as follows. We first overvie w the related work on dictionary learning in Section II. The stereo image model is introduced in Section III. Section IV p resents the optimization problem for learning dictionarie s adapted to stereo images, while its energy minimization solution in g ven in Section V-B. Experimental results in omnidirectio nal imaging are presented in Section VI.

[1]  Pascal Frossard,et al.  Conditions for recovery of sparse signals correlated by local transforms , 2009, 2009 IEEE International Symposium on Information Theory.

[2]  I. Tosic,et al.  On unifying sparsity and geometry for image-based 3D scene representation , 2009 .

[3]  Pierre Vandergheynst,et al.  Learning sparse generative models of audiovisual signals , 2008, 2008 16th European Signal Processing Conference.

[4]  Pascal Frossard,et al.  Geometry-Based Distributed Scene Representation With Omnidirectional Vision Sensors , 2008, IEEE Transactions on Image Processing.

[5]  Bruno A. Olshausen,et al.  Learning Transformational Invariants from Time-Varying Natural Images , 2008, NIPS 2008.

[6]  Pascal Frossard,et al.  Coarse scene geometry estimation from sparse approximations of multi-view omnidirectional images , 2007, 2007 15th European Signal Processing Conference.

[7]  Bruno A. Olshausen,et al.  Bilinear models of natural images , 2007, Electronic Imaging.

[8]  Kjersti Engan,et al.  Family of iterative LS-based dictionary learning algorithms, ILS-DLA, for sparse signal representation , 2007, Digit. Signal Process..

[9]  Pascal Frossard,et al.  Progressive Coding of 3-D Objects Based on Overcomplete Decompositions , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[10]  Pierre Vandergheynst,et al.  MoTIF: An Efficient Algorithm for Learning Translation Invariant Dictionaries , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  Pascal Frossard,et al.  Low-rate and flexible image coding with redundant representations , 2006, IEEE Transactions on Image Processing.

[12]  Pascal Frossard,et al.  Multiresolution motion estimation for omnidirectional images , 2005, 2005 13th European Signal Processing Conference.

[13]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[14]  Kenji Okajima,et al.  Binocular disparity encoding cells generated through an Infomax based learning algorithm , 2004, Neural Networks.

[15]  Avideh Zakhor,et al.  Dictionary design for matching pursuit and application to motion-compensated video coding , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Rémi Gribonval,et al.  Sparse representations in unions of bases , 2003, IEEE Trans. Inf. Theory.

[17]  Bruno A. Olshausen,et al.  Learning sparse, overcomplete representations of time-varying natural images , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[18]  S. Shankar Sastry,et al.  An Invitation to 3-D Vision: From Images to Geometric Models , 2003 .

[19]  Joseph F. Murray,et al.  Dictionary Learning Algorithms for Sparse Representation , 2003, Neural Computation.

[20]  Avideh Zakhor,et al.  Learning dictionaries for matching pursuits based video coders , 2001, Proceedings 2001 International Conference on Image Processing (Cat. No.01CH37205).

[21]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[22]  P O Hoyer,et al.  Independent component analysis applied to feature extraction from colour and stereo images , 2000, Network.

[23]  Høgskolen i Stavanger FRAME DESIGN USING FOCUSS WITH METHOD OF OPTIMAL DIRECTIONS (MOD) , 2000 .

[24]  Kjersti Engan,et al.  Method of optimal directions for frame design , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[25]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[26]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[27]  Bhaskar D. Rao,et al.  Sparse signal reconstruction from limited data using FOCUSS: a re-weighted minimum norm algorithm , 1997, IEEE Trans. Signal Process..

[28]  Bernice E. Rogowitz,et al.  Conference on Human Vision and Electronic Imaging , 1996 .

[29]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[30]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[31]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[32]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .