Tracklet Pair Proposal and Context Reasoning for Video Scene Graph Generation

Video scene graph generation (ViDSGG), the creation of video scene graphs that helps in deeper and better visual scene understanding, is a challenging task. Segment-based and sliding-window based methods have been proposed to perform this task. However, they all have certain limitations. This study proposes a novel deep neural network model called VSGG-Net for video scene graph generation. The model uses a sliding window scheme to detect object tracklets of various lengths throughout the entire video. In particular, the proposed model presents a new tracklet pair proposal method that evaluates the relatedness of object tracklet pairs using a pretrained neural network and statistical information. To effectively utilize the spatio-temporal context, low-level visual context reasoning is performed using a spatio-temporal context graph and a graph neural network as well as high-level semantic context reasoning. To improve the detection performance for sparse relationships, the proposed model applies a class weighting technique that adjusts the weight of sparse relationships to a higher level. This study demonstrates the positive effect and high performance of the proposed model through experiments using the benchmark dataset VidOR and VidVRD.

[1]  Stefan Lee,et al.  Graph R-CNN for Scene Graph Generation , 2018, ECCV.

[2]  Yu Cao,et al.  Annotating Objects and Relations in User-Generated Videos , 2019, ICMR.

[3]  Shizhe Chen,et al.  Relation Understanding in Videos , 2019, ACM Multimedia.

[4]  Guanghui Ren,et al.  Video Relation Detection with Trajectory-aware Multi-modal Features , 2020, ACM Multimedia.

[5]  Danfei Xu,et al.  Scene Graph Generation by Iterative Message Passing , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Shiliang Pu,et al.  Video Relation Detection with Spatio-Temporal Graph , 2019, ACM Multimedia.

[7]  Gangshan Wu,et al.  Video Visual Relation Detection via Multi-modal Feature Fusion , 2019, ACM Multimedia.

[8]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Tat-Seng Chua,et al.  Video Relation Detection via Multiple Hypothesis Association , 2020, ACM Multimedia.

[10]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[11]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Tat-Seng Chua,et al.  Video Visual Relation Detection , 2017, ACM Multimedia.

[14]  Ali Farhadi,et al.  Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Bo Dai,et al.  Detecting Visual Relationships with Deep Relational Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yejin Choi,et al.  Neural Motifs: Scene Graph Parsing with Global Context , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Yadong Mu,et al.  Beyond Short-Term Snippet: Video Relation Detection With Spatio-Temporal Global Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Pengfei Xu,et al.  A Survey of Scene Graph: Generation and Application , 2020 .

[20]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[21]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.