Object Level Grouping for Video Shots

We describe a method for automatically obtaining object representations suitable for retrieval from generic video shots. The object representation consists of an association of frame regions. These regions provide exemplars of the object’s possible visual appearances.Two ideas are developed: (i) associating regions within a single shot to represent a deforming object; (ii) associating regions from the multiple visual aspects of a 3D object, thereby implicitly representing 3D structure. For the association we exploit temporal continuity (tracking) and wide baseline matching of affine covariant regions.In the implementation there are three areas of novelty: First, we describe a method to repair short gaps in tracks. Second, we show how to join tracks across occlusions (where many tracks terminate simultaneously). Third, we develop an affine factorization method that copes with motion degeneracy.We obtain tracks that last throughout the shot, without requiring a 3D reconstruction. The factorization method is used to associate tracks into object-level groups, with common motion. The outcome is that separate parts of an object that are not simultaneously visible (such as the front and back of a car, or the front and side of a face) are associated together. In turn this enables object-level matching and recognition throughout a video.We illustrate the method on the feature film “Groundhog Day.” Examples are given for the retrieval of deforming objects (heads, walking people) and rigid objects (vehicles, locations).

[1]  Cordelia Schmid,et al.  An Affine Invariant Interest Point Detector , 2002, ECCV.

[2]  Richard Szeliski,et al.  An Integrated Bayesian Approach to Layer Extraction from Image Sequences , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  Andrew Zisserman,et al.  Multi-view Matching for Unordered Image Sets, or "How Do I Organize My Holiday Snaps?" , 2002, ECCV.

[4]  Andrew Zisserman,et al.  Automatic Camera Tracking , 2003 .

[5]  Cordelia Schmid,et al.  3D object modeling and recognition using affine-invariant patches and multi-view spatial constraints , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[6]  Luc Van Gool,et al.  Wide-baseline multiple-view correspondences , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[7]  Biswajit Bose,et al.  Enhanced Video Representation Using Objects , 2002, ICVGIP.

[8]  Robert C. Bolles,et al.  Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , 1981, CACM.

[9]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[10]  Harry Shum,et al.  Principal Component Analysis with Missing Data and Its Application to Polyhedral Object Modeling , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Cordelia Schmid,et al.  Appariement d'images par invariants locaux de niveaux de gris. Application à l'indexation d'une base d'objets. (Image matching by local greyvalue invariants. Applied to indexing an object database) , 1996 .

[12]  Michael J. Black,et al.  A Framework for Robust Subspace Learning , 2003, International Journal of Computer Vision.

[13]  Henrik Aanæs,et al.  Robust Factorization , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  David W. Jacobs,et al.  Linear fitting with missing data: applications to structure-from-motion and to characterizing intensity images , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Heinrich H. Bülthoff,et al.  Acquiring Robust Representations for Recognition from Image Sequences , 2001, DAGM-Symposium.

[16]  Luc Van Gool,et al.  Simultaneous Object Recognition and Segmentation by Image Exploration , 2004, ECCV.

[17]  J. Ponce,et al.  Segmenting, modeling, and matching video clips containing multiple moving objects , 2004, CVPR 2004.

[18]  Heinrich H. Bülthoff,et al.  Automatic acquisition of exemplar-based representations for recognition from image sequences , 2001, CVPR 2001.

[19]  John R. Kender,et al.  Video scene segmentation via continuous video coherence , 1998, Proceedings. 1998 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No.98CB36231).

[20]  David G. Lowe,et al.  Local feature view clustering for 3D object recognition , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[21]  Tinne Tuytelaars,et al.  Integrating multiple model views for object recognition , 2004, CVPR 2004.

[22]  Andrew Zisserman,et al.  Multiple view geometry in computer visiond , 2001 .

[23]  Luc Van Gool,et al.  Wide Baseline Stereo Matching based on Local, Affinely Invariant Regions , 2000, BMVC.

[24]  Luc Van Gool,et al.  Video shot characterization , 2004, Machine Vision and Applications.

[25]  Andrew Zisserman,et al.  Object Level Grouping for Video Shots , 2004, ECCV.

[26]  Bernhard P. Wrobel,et al.  Multiple View Geometry in Computer Vision , 2001 .

[27]  Andrew Zisserman,et al.  Robust Detection of Degenerate Configurations while Estimating the Fundamental Matrix , 1998, Comput. Vis. Image Underst..

[28]  Andrew Zisserman,et al.  Automated location matching in movies , 2003, Comput. Vis. Image Underst..

[29]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[30]  Andrew Zisserman,et al.  Robust detection of degenerate configurations for the fundamental matrix , 1995, Proceedings of IEEE International Conference on Computer Vision.

[31]  Robert C. Bolles,et al.  Epipolar-plane image analysis: An approach to determining structure from motion , 1987, International Journal of Computer Vision.

[32]  Richard Szeliski,et al.  An integrated Bayesian approach to layer extraction from image sequences , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[33]  Lihi Zelnik-Manor,et al.  Multi-view subspace constraints on homographies , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[34]  Cordelia Schmid,et al.  3D Object Modeling and Recognition Using Local Affine-Invariant Image Descriptors and Multi-View Spatial Constraints , 2006, International Journal of Computer Vision.

[35]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..