Detection and Description of Change in Visual Streams

This paper presents a framework for the analysis of changes in visual streams: ordered sequences of images, possibly separated by significant time gaps. We propose a new approach to incorporating unlabeled data into training to generate natural language descriptions of change. We also develop a framework for estimating the time of change in visual stream. We use learned representations for change evidence and consistency of perceived change, and combine these in a regularized graph cut based change detector. Experimental evaluation on visual stream datasets, which we release as part of our contribution, shows that representation learning driven by natural language descriptions significantly improves change detection accuracy, compared to methods that do not rely on language.

[1]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Nassir Navab,et al.  Towards Unsupervised Image Captioning With Shared Multimodal Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Alessandro Rinaldo,et al.  Optimal nonparametric multivariate change point detection and localization , 2021, IEEE Transactions on Information Theory.

[5]  Jun Yu,et al.  An efficient method for scene cut detection , 2001, Pattern Recognit. Lett..

[6]  Yi Yang,et al.  Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  David S. Matteson,et al.  A Nonparametric Approach for Multiple Change Point Analysis of Multivariate Data , 2013, 1306.4933.

[8]  Harsh Jhamtani,et al.  Learning to Describe Differences Between Pairs of Similar Images , 2018, EMNLP.

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Yang Feng,et al.  Unsupervised Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  S. Domnic,et al.  Video cut detection using block based histogram differences in RGB color space , 2010, 2010 International Conference on Signal and Image Processing.

[12]  Gang Wang,et al.  Unpaired Image Captioning via Scene Graph Alignments , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Piotr Fryzlewicz,et al.  Multiple‐change‐point detection for high dimensional time series via sparsified binary segmentation , 2015, 1611.08639.

[14]  Jorma Laaksonen,et al.  Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation , 2016, ACM Multimedia.

[15]  Sylvain Arlot,et al.  Consistent change-point detection with kernels , 2016, 1612.04740.

[16]  Rita Cucchiara,et al.  Hierarchical Boundary-Aware Neural Encoder for Video Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Xiaogang Wang,et al.  Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data , 2018, ECCV.

[19]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[21]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Takeo Kanade,et al.  Real-time topometric localization , 2012, 2012 IEEE International Conference on Robotics and Automation.

[24]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[25]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Trevor Darrell,et al.  Robust Change Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Zhiguang Qin,et al.  Fully Convolutional CaptionNet: Siamese Difference Captioning Attention Model , 2019, IEEE Access.

[30]  Tae-Hyun Oh,et al.  Image Captioning with Very Scarce Supervised Data: Adversarial Semi-Supervised Learning Approach , 2019, EMNLP/IJCNLP.

[31]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[32]  Gang Wang,et al.  Unpaired Image Captioning by Language Pivoting , 2018, ECCV.

[33]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[34]  Trevor Darrell,et al.  Captioning Images with Diverse Objects , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Xin Wang,et al.  Video Captioning via Hierarchical Reinforcement Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Luowei Zhou,et al.  End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Sanja Fidler,et al.  Teaching Machines to Describe Images via Natural Language Feedback , 2017, ArXiv.

[38]  Stephen W. Smoliar,et al.  Content based video indexing and retrieval , 1994, IEEE MultiMedia.

[39]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[40]  Michal Irani,et al.  Video indexing based on mosaic representations , 1998, Proc. IEEE.

[41]  Germán Ros,et al.  Street-view change detection with deconvolutional networks , 2016, Autonomous Robots.

[42]  Tengyao Wang,et al.  High dimensional change point estimation via sparse projection , 2016, 1606.06246.

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Pierre Isabelle,et al.  Proceedings of the 40th Annual Meeting on Association for Computational Linguistics , 2002, ACL 2002.

[47]  Kin-Man Lam,et al.  Scene cut detection using the colored pattern appearance model , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[48]  Li Li,et al.  A Survey on Visual Content-Based Video Indexing and Retrieval , 2011, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[49]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[50]  Gregory Shakhnarovich,et al.  Discriminability Objective for Training Descriptive Captions , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.