Character Grounding and Re-identification in Story of Videos and Text Descriptions

We address character grounding and re-identification in multiple story-based videos like movies and associated text descriptions. In order to solve these related tasks in a mutually rewarding way, we propose a model named Character in Story Identification Network (CiSIN). Our method builds two semantically informative representations via joint training of multiple objectives for character grounding, video/text reidentification and gender prediction: Visual Track Embedding from videos and Textual Character Embedding from text context. These two representations are learned to retain rich semantic multimodal information that enables even simple MLPs to achieve the state-of-the-art performance on the target tasks. More specifically, our CiSIN model achieves the best performance in the Fill-in the Characters task of LSMDC 2019 challenges [35]. Moreover, it outperforms previous state-of-the-art models in M-VAD Names dataset [30] as a benchmark of multimodal character grounding and re-identification.

[1]  Dietrich Paulus,et al.  Simple online and realtime tracking with a deep association metric , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[2]  Sanja Fidler,et al.  Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[4]  José M. F. Moura,et al.  Visual Coreference Resolution in Visual Dialog using Neural Module Networks , 2018, ECCV.

[5]  Andrew Zisserman,et al.  From Benedict Cumberbatch to Sherlock Holmes: Character Identification in TV series without a Script , 2018, BMVC.

[6]  Qi Tian,et al.  Scalable Person Re-identification: A Benchmark , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[9]  Xiaogang Wang,et al.  Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Dahua Lin,et al.  Person Search in Videos with One Portrait Through Visual and Temporal Links , 2018, ECCV.

[11]  Qi Tian,et al.  MARS: A Video Benchmark for Large-Scale Person Re-Identification , 2016, ECCV.

[12]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[13]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[14]  Andrew Zisserman,et al.  “Who are you?” - Learning person specific classifiers from video , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Seong Joon Oh,et al.  Generating Descriptions with Grounded and Co-referenced People , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Xiaogang Wang,et al.  Identity-Aware Textual-Visual Matching with Latent Co-attention , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Timothy Dozat,et al.  Universal Dependency Parsing from Scratch , 2019, CoNLL.

[18]  O. Parkhi It ’ s in the bag : Stronger supervision for automated face labelling , 2015 .

[19]  Michael Jones,et al.  An improved deep learning architecture for person re-identification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Leonid Sigal,et al.  Learning Language-Visual Embedding for Movie Understanding with Natural-Language , 2016, ArXiv.

[21]  Cordelia Schmid,et al.  Finding Actors and Actions in Movies , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Wei Jiang,et al.  Bag of Tricks and a Strong Baseline for Deep Person Re-Identification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Fei-Fei Li,et al.  Linking People in Videos with "Their" Names Using Coreference Resolution , 2014, ECCV.

[25]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[26]  Ning Zhang,et al.  Beyond frontal faces: Improving Person Recognition using multiple cues , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Nanning Zheng,et al.  Person Re-identification by Multi-Channel Parts-Based CNN with Improved Triplet Loss Function , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Andrew Zisserman,et al.  Hello! My name is... Buffy'' -- Automatic Naming of Characters in TV Video , 2006, BMVC.

[29]  Bingbing Ni,et al.  Learning Context Graph for Person Search , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Hai Tao,et al.  Viewpoint Invariant Pedestrian Recognition with an Ensemble of Localized Features , 2008, ECCV.

[31]  Xiaogang Wang,et al.  Diversity Regularized Spatiotemporal Attention for Video-Based Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Seong Joon Oh,et al.  Person Recognition in Personal Photo Collections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[34]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[35]  Rita Cucchiara,et al.  M-VAD names: a dataset for video captioning with naming , 2018, Multimedia Tools and Applications.

[36]  Dahua Lin,et al.  Unifying Identification and Context Learning for Person Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Jianxin Wu,et al.  Person Re-Identification with Correspondence Structure Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[38]  Alessandro Perina,et al.  Person re-identification by symmetry-driven accumulation of local features , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[39]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[40]  Christopher Joseph Pal,et al.  Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.

[41]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[43]  Bohyung Han,et al.  Visual Reference Resolution using Attention Memory for Visual Dialog , 2017, NIPS.

[44]  Trevor Darrell,et al.  Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Rainer Stiefelhagen,et al.  “Knock! Knock! Who is it?” probabilistic person identification in TV-series , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Xiaogang Wang,et al.  DeepReID: Deep Filter Pairing Neural Network for Person Re-identification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Shiliang Zhang,et al.  Pose-Driven Deep Convolutional Model for Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[49]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Richard I. Hartley,et al.  Person Reidentification Using Spatiotemporal Appearance , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[51]  Naokazu Yokoya,et al.  Learning Joint Representations of Videos and Sentences with Web Image Search , 2016, ECCV Workshops.

[52]  Q. Tian,et al.  GLAD: Global-Local-Alignment Descriptor for Pedestrian Retrieval , 2017, ACM Multimedia.

[53]  Stefanos Zafeiriou,et al.  ArcFace: Additive Angular Margin Loss for Deep Face Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).