Face-to-Face Contrastive Learning for Social Intelligence Question-Answering

—Creating artificial social intelligence – algorithms that can understand the nuances of multi-person interactions – is an exciting and emerging challenge in processing facial expressions and gestures from multimodal videos. Recent multimodal methods have set the state of the art on many tasks, but have difficulty modeling the complex face-to-face conversational dynamics across speaking turns in social interaction, particularly in a self-supervised setup. In this paper, we propose Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions using factorization nodes to contextualize the multimodal face-to-face interaction along the boundaries of the speaking turn. With the F2F-CL model, we propose to perform contrastive learning between the factorization nodes of different speaking turns within the same video. We experimentally evaluate our method on the challenging Social-IQ dataset and show state-of-the-art results.

[1]  R. G. Krishnan,et al.  Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yejin Choi,et al.  MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Florian Schroff,et al.  Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision , 2021, Computer Vision and Pattern Recognition.

[4]  Haohang Xu,et al.  Motion-aware Contrastive Video Representation Learning via Foreground-background Merging , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[6]  Eran Yahav,et al.  How Attentive are Graph Attention Networks? , 2021, ICLR.

[7]  Cyrill Stachniss,et al.  Video Contrastive Learning with Global Context , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[8]  Xiaolong Wang,et al.  Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Louis-Philippe Morency,et al.  MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences , 2021, NAACL.

[10]  Xiaolong Wang,et al.  Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Daniel J. McDuff,et al.  Contrastive Learning of Global and Local Video Representations , 2021, NeurIPS.

[12]  Fillia Makedon,et al.  A Survey on Contrastive Self-supervised Learning , 2020, Technologies.

[13]  Zhangyang Wang,et al.  Graph Contrastive Learning with Augmentations , 2020, NeurIPS.

[14]  Jingyuan Wang,et al.  Learning Effective Road Network Representation with Hierarchical Graph Neural Networks , 2020, KDD.

[15]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[16]  Yuqing Tang,et al.  Cross-lingual Retrieval for Iterative Self-Supervised Training , 2020, NeurIPS.

[17]  Dinesh Manocha,et al.  MCQA: Multimodal Co-attention Based Network for Question Answering , 2020, ArXiv.

[18]  Hui Wang,et al.  Iterative Context-Aware Graph Inference for Visual Dialog , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yizhou Sun,et al.  Heterogeneous Graph Transformer , 2020, WWW.

[20]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[21]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[23]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[24]  Louis-Philippe Morency,et al.  Factorized Multimodal Transformer for Multimodal Sequential Learning , 2019, ArXiv.

[25]  Elisa De Stefani,et al.  Language, Gesture, and Emotional Communication: An Embodied View of Social Interaction , 2019, Front. Psychol..

[26]  Boris Ginsburg,et al.  NeMo: a toolkit for building AI applications using Neural Modules , 2019, ArXiv.

[27]  Louis-Philippe Morency,et al.  Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[29]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Pietro Liò,et al.  Deep Graph Infomax , 2018, ICLR.

[31]  Ruslan Salakhutdinov,et al.  Learning Factorized Multimodal Representations , 2018, ICLR.

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[34]  Erik Cambria,et al.  Memory Fusion Network for Multi-view Sequential Learning , 2018, AAAI.

[35]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Erik Cambria,et al.  Tensor Fusion Network for Multimodal Sentiment Analysis , 2017, EMNLP.

[37]  Sanja Fidler,et al.  MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Yale Song,et al.  Action Recognition by Hierarchical Sequence Summarization , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[40]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[41]  F. Scarselli,et al.  A new model for learning in graph domains , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[42]  C. Habel,et al.  Language , 1931, NeuroImage.