DISCOV: A Framework for Discovering Objects in Video

This paper presents a probabilistic framework for discovering objects in video. The video can switch between different shots, the unknown objects can leave or enter the scene at multiple times, and the background can be cluttered. The framework consists of an appearance model and a motion model. The appearance model exploits the consistency of object parts in appearance across frames. We use maximally stable extremal regions as observations in the model and hence provide robustness to object variations in scale, lighting and viewpoint. The appearance model provides location and scale estimates of the unknown objects through a compact probabilistic representation. The compact representation contains knowledge of the scene at the object level, thus allowing us to augment it with motion information using a motion model. This framework can be applied to a wide range of different videos and object types, and provides a basis for higher level video content analysis tasks. We present applications of video object discovery to video content analysis problems such as video segmentation and threading, and demonstrate superior performance to methods that exploit global image statistics and frequent itemset data mining techniques.

[1]  Noel A Cressie,et al.  Statistics for Spatial Data. , 1992 .

[2]  A. Murat Tekalp,et al.  Temporal video segmentation using unsupervised clustering and semantic object tracking , 1998, J. Electronic Imaging.

[3]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[4]  Andrew W. Moore,et al.  Variable KD-Tree Algorithms for Spatial Pattern Search and Discovery , 2005, NIPS.

[5]  Nikos Paragios,et al.  Motion-based background subtraction using adaptive kernel density estimation , 2004, CVPR 2004.

[6]  Martial Hebert,et al.  Extracting Scale and Illuminant Invariant Regions through Color , 2006, BMVC.

[7]  Andrew Zisserman,et al.  Video data mining using configurations of viewpoint invariant regions , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[8]  Trevor Darrell,et al.  The pyramid match kernel: discriminative classification with sets of image features , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[9]  L. Davis,et al.  Background and foreground modeling using nonparametric kernel density estimation for visual surveillance , 2002, Proc. IEEE.

[10]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[11]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  Luc Van Gool,et al.  Modeling scenes with local descriptors and latent aspects , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[13]  Shahram Ebadollahi,et al.  Visual Event Detection using Multi-Dimensional Concept Dynamics , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[14]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[15]  J. Wang,et al.  Improving color based video shot detection , 1999, Proceedings IEEE International Conference on Multimedia Computing and Systems.

[16]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Nebojsa Jojic,et al.  LOCUS: learning object classes with unsupervised segmentation , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[19]  Stephen Lin,et al.  Resolution-Enhanced Photometric Stereo , 2006, ECCV.

[20]  M. Leordeanu,et al.  Unsupervised learning of object features from video sequences , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Dorin Comaniciu,et al.  Real-time tracking of non-rigid objects using mean shift , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[22]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[23]  Takeo Kanade,et al.  Robust subspace clustering by combined use of kNND metric and SVD algorithm , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[24]  Chong-Wah Ngo,et al.  Moving-object detection, association, and selection in home videos , 2007, IEEE Transactions on Multimedia.

[25]  Jiri Matas,et al.  Robust wide-baseline stereo from maximally stable extremal regions , 2004, Image Vis. Comput..

[26]  HongJiang Zhang,et al.  Video Snapshot: A Bird View of Video Sequence , 2005, 11th International Multimedia Modelling Conference.

[27]  Seth J. Teller,et al.  Particle Video: Long-Range Motion Estimation Using Point Trajectories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[28]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[29]  Mike Rees,et al.  5. Statistics for Spatial Data , 1993 .

[30]  W. Eric L. Grimson,et al.  Learning Patterns of Activity Using Real-Time Tracking , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Tsuhan Chen,et al.  Semantic-Shift for Unsupervised Object Detection , 2006, 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW'06).

[32]  Peter J. Huber,et al.  Robust Statistics , 2005, Wiley Series in Probability and Statistics.

[33]  Stefanos D. Kollias,et al.  An efficient fully unsupervised video object segmentation scheme using an adaptive neural-network classifier architecture , 2003, IEEE Trans. Neural Networks.

[34]  Shih-Fu Chang,et al.  Pattern Mining in Visual Concept Streams , 2006, 2006 IEEE International Conference on Multimedia and Expo.

[35]  Thomas Leung,et al.  Texton Correlation for Recognition , 2004, ECCV.

[36]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[37]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[38]  Y. Bar-Shalom Tracking and data association , 1988 .

[39]  P. Anandan,et al.  A Unified Approach to Moving Object Detection in 2D and 3D Scenes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[40]  David A. Forsyth,et al.  Building models of animals from video , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  L. Wixson Detecting Salient Motion by Accumulating Directionally-Consistent Flow , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Takeo Kanade,et al.  Object Detection Using the Statistics of Parts , 2004, International Journal of Computer Vision.

[43]  Tsuhan Chen,et al.  A Topic-Motion Model for Unsupervised Video Object Discovery , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Andrew W. Moore,et al.  Detecting Significant Multidimensional Spatial Clusters , 2004, NIPS.

[45]  Alexei A. Efros,et al.  Discovering objects and their location in images , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[46]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.