Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream Retrieval

This paper tackles a new problem in computer vision: mid-stream video-to-video retrieval. This task, which consists in searching a database for content similar to a video right as it is playing, e.g. from a live stream, exhibits challenging characteristics. Only the beginning part of the video is available as query and new frames are constantly added as the video plays out. To perform retrieval in this demanding situation, we propose an approach based on a binary encoder that is both predictive and incremental in order to (1) account for the missing video content at query time and (2) keep up with repeated, continuously evolving queries throughout the streaming. In particular, we present the first hashing framework that infers the unseen future content of a currently playing video. Experiments on FCVID and ActivityNet demonstrate the feasibility of this task. Our approach also yields a significant mAP@20 performance increase compared to a baseline adapted from the literature for this task, for instance 7.4% (2.6%) increase at 20% (50%) of elapsed runtime on FCVID using bitcodes of size 192 bits.

[1]  Gang Wang,et al.  Early Action Prediction by Soft Regression , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Lars Petersson,et al.  Encouraging LSTMs to Anticipate Actions Very Early , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Sridha Sridharan,et al.  Predicting the Future: A Jointly Learnt Model for Action Anticipation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[6]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[7]  Ling Shao,et al.  Unsupervised Deep Video Hashing via Balanced Code for Large-Scale Video Retrieval , 2019, IEEE Transactions on Image Processing.

[8]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[10]  Jiwen Lu,et al.  Neighborhood Preserving Hashing for Scalable Video Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Dacheng Tao,et al.  DistillHash: Unsupervised Deep Hashing by Distilling Data Pairs , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zi Huang,et al.  Jointly Modeling Static Visual Appearance and Temporal Pattern for Unsupervised Video Hashing , 2017, CIKM.

[14]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Gaurav Yengera,et al.  Future-State Predicting LSTM for Early Surgery Type Recognition , 2018, IEEE Transactions on Medical Imaging.

[17]  Yutaka Satoh,et al.  Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[18]  Peisong Wang,et al.  K-Nearest Neighbors Hashing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yun Fu,et al.  Deep Sequential Context Networks for Action Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jianhuang Lai,et al.  Progressive Teacher-Student Learning for Early Action Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[22]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Meng Wang,et al.  Self-Supervised Video Hashing With Hierarchical Binary Auto-Encoder , 2018, IEEE Transactions on Image Processing.

[24]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Shih-Fu Chang,et al.  Spherical hashing , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yi Zhu,et al.  Hidden Two-Stream Convolutional Networks for Action Recognition , 2017, ACCV.

[28]  Meng Wang,et al.  Play and Rewind: Optimizing Binary Representations of Videos by Self-Supervised Temporal Hashing , 2016, ACM Multimedia.

[29]  Ling Shao,et al.  Unsupervised Deep Video Hashing with Balanced Rotation , 2017, IJCAI.

[30]  Limin Wang,et al.  Temporal Segment Networks for Action Recognition in Videos , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.