Mental Visual Indexing: Towards Fast Video Browsing

Video browsing describes an interactive process where users want to find a target shot in a long video. Therefore, it is crucial for a video browsing system to be fast and accurate with minimum user effort. In sharp contrast to traditional Relevance Feedback (RF), we propose a novel paradigm for fast video browsing dubbed Mental Visual Indexing (MVI). At each interactive round, the user only needs to select one of the displayed shots that is most visually similar to her mental target and then the user's choice will further tailor the search to the target. The search model update given a user feedback only requires vector inner products, which makes MVI highly responsive. MVI is underpinned by a sequence model in terms of Recurrent Neural Network (RNN), which is trained by automatically generated shot sequences from a rigorous Bayesian framework, which simulates user feedback process. Experimental results on three 3-hour movies conducted by real users demonstrate the effectiveness of the proposed approach.

[1]  Hervé Jégou,et al.  Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening , 2012, ECCV.

[2]  Huanbo Luan,et al.  Discrete Collaborative Filtering , 2016, SIGIR.

[3]  Alexandra Nordin,et al.  Improving an Information Retrieval System by Using Machine Learning to Improve User Relevance Feedback , 2016 .

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Teruko Mitamura,et al.  Zero-Example Event Search using MultiModal Pseudo Relevance Feedback , 2014, ICMR.

[6]  Marin Ferecatu,et al.  A Statistical Framework for Image Category Search from a Mental Picture , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[8]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[9]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[10]  Duy-Dinh Le,et al.  The Video Browser Showdown: a live evaluation of interactive video search tools , 2013, International Journal of Multimedia Information Retrieval.

[11]  Yiannis Kompatsiaris,et al.  Local Invariant Feature Tracks for high-level video feature extraction , 2010, 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10.

[12]  Nicu Sebe,et al.  Complex Event Detection via Multi-source Video Attributes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Shuicheng Yan,et al.  Attribute feedback , 2012, ACM Multimedia.

[14]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[15]  Yi Yang,et al.  Interactive Video Indexing With Statistical Active Learning , 2012, IEEE Transactions on Multimedia.

[16]  Tat-Seng Chua,et al.  Mental Visual Browsing , 2016, MMM.

[17]  Nicu Sebe,et al.  Fisher kernel based relevance feedback for multimodal video retrieval , 2013, ICMR '13.

[18]  Tat-Seng Chua,et al.  Online Collaborative Learning for Open-Vocabulary Visual Classifiers , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yongdong Zhang,et al.  Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base , 2015, IEEE Transactions on Multimedia.

[20]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[21]  Yue Gao,et al.  View-Based Discriminative Probabilistic Modeling for 3D Object Retrieval and Recognition , 2013, IEEE Transactions on Image Processing.

[22]  Thomas S. Huang,et al.  Relevance feedback in image retrieval: A comprehensive review , 2003, Multimedia Systems.