Simple Dialogue System with AUDITED

We devise a multimodal conversation system for dialogue utterances composed of text, image or both modalities. We leverage Auxiliary UnsuperviseD vIsual and TExtual Data (AUDITED). To improve the performance of text-based task, we utilize translations of target sentences from English to French to form the assisted supervision. For the image-based task, we employ the DeepFashion dataset in which we seek nearest neighbor images of positive and negative target images of the MMD data. These nearest neighbors form the nearest neighbor embedding providing an external context for target images. We form two methods to create neighbor embedding vectors, namely Neighbor Embedding by Hard Assignment (NEHA) and Neighbor Embedding by Soft Assignment (NESA) which generate context subspaces per target image. Subsequently, these subspaces are learnt by our pipeline as a context for the target data. We also propose a discriminator which switches between the imageand text-based tasks. We show improvements over baselines on the large-scale Multimodal Dialogue Dataset (MMD) and SIMMC.

[1]  Indrani Bhattacharya,et al.  Multimodal Dialog for Browsing Large Visual Catalogs using Exploration-Exploitation Paradigm in a Joint Embedding Space , 2019, ICMR.

[2]  Weiwei Hou,et al.  A Token-wise CNN-based Method for Sentence Compression , 2020, ICONIP.

[3]  Philip H. S. Torr,et al.  Rethinking Class Relations: Absolute-relative Supervised and Unsupervised Few-shot Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Piotr Koniusz,et al.  Simple Spectral Graph Convolution , 2021, ICLR.

[5]  Andrew Zisserman,et al.  Multi-task Self-Supervised Visual Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Thomas Brox,et al.  Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  E. Thorndike,et al.  The influence of improvement in one mental function upon the efficiency of other functions. (I). , 1901 .

[8]  Alan Ritter,et al.  Unsupervised Modeling of Twitter Conversations , 2010, NAACL.

[9]  Fatih Murat Porikli,et al.  Domain Adaptation by Mixture of Alignments of Second-or Higher-Order Scatter Tensors , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Jitendra Malik,et al.  Learning to See by Moving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[13]  Yupeng Gao,et al.  Fashion IQ: A New Dataset towards Retrieving Images by Natural Language Feedback , 2019 .

[14]  Hongguang Zhang,et al.  Power Normalizing Second-Order Similarity Network for Few-Shot Learning , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[15]  Piotr Koniusz,et al.  REFINE: Random RangE FInder for Network Embedding , 2021, CIKM.

[16]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[17]  Joelle Pineau,et al.  A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues , 2016, AAAI.

[18]  Yi Pan,et al.  Conversational AI: The Science Behind the Alexa Prize , 2018, ArXiv.

[19]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[20]  V. Marian,et al.  The Cognitive Benefits of Being Bilingual , 2012, Cerebrum : the Dana forum on brain science.

[21]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[22]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[23]  Piotr Koniusz,et al.  Power Normalizations in Fine-Grained Image, Few-Shot Image and Graph Classification , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Rafael E. Banchs Movie-DiC: a Movie Dialogue Corpus for Research and Development , 2012, ACL.

[25]  Jianfeng Gao,et al.  A Neural Network Approach to Context-Sensitive Generation of Conversational Responses , 2015, NAACL.

[26]  Piotr Koniusz,et al.  CNN-based Action Recognition and Supervised Domain Adaptation on 3D Body Skeletons via Kernel Feature Maps , 2018, BMVC.

[27]  Xin Yu,et al.  Identity-Preserving Face Recovery from Stylized Portraits , 2019, International Journal of Computer Vision.

[28]  Raymond J. Mooney,et al.  Joint Image Captioning and Question Answering , 2018, ArXiv.

[29]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Xiaoou Tang,et al.  Image Super-Resolution Using Deep Convolutional Networks , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Verena Rieser,et al.  A Knowledge-Grounded Multimodal Search-Based Conversational Agent , 2018, SCAI@EMNLP.

[32]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[33]  José M. F. Moura,et al.  CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog , 2019, NAACL.

[34]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[35]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Kenneth Heafield,et al.  ParaCrawl: Web-Scale Acquisition of Parallel Corpora , 2020, ACL.

[38]  David Vandyke,et al.  A Network-based End-to-End Trainable Task-oriented Dialogue System , 2016, EACL.

[39]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Krystian Mikolajczyk,et al.  Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[42]  Rui Zhang,et al.  Artwork Identification from Wearable Camera Images for Enhancing Experience of Museum Audiences , 2018, ArXiv.

[43]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[44]  Piotr Koniusz,et al.  Contrastive Laplacian Eigenmaps , 2022, NeurIPS.

[45]  Krystian Mikolajczyk,et al.  Comparison of mid-level feature coding approaches and pooling strategies in visual concept detection , 2013, Comput. Vis. Image Underst..

[46]  Jesse Thomason,et al.  Vision-and-Dialog Navigation , 2019, CoRL.

[47]  Du Q. Huynh,et al.  Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition With CNNs , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Mitesh M. Khapra,et al.  Towards Building Large Scale Multimodal Domain-Aware Conversation Systems , 2017, AAAI.

[50]  Luísa Coheur,et al.  Luke, I am Your Father: Dealing with Out-of-Domain Requests by Using Movies Subtitles , 2014, IVA.

[51]  Paul A. Crook,et al.  Situated and Interactive Multimodal Conversations , 2020, COLING.

[52]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[53]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[54]  Xin Yu,et al.  Recovering Faces From Portraits with Auxiliary Facial Attributes , 2019, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[55]  Hang Li,et al.  Neural Responding Machine for Short-Text Conversation , 2015, ACL.

[56]  Rui Zhang,et al.  Museum Exhibit Identification Challenge for the Supervised Domain Adaptation and Beyond , 2018, ECCV.

[57]  Lei Wang,et al.  Few-Shot Object Detection by Second-Order Pooling , 2020, ACCV.

[58]  Juan Carlos Niebles,et al.  Leveraging Video Descriptions to Learn Video Question Answering , 2016, AAAI.

[59]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[60]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.