ACMNet: Adaptive Confidence Matching Network for Human Behavior Analysis via Cross-modal Retrieval

HUI CHEN and GUIGUANG DING, Beijing National Research Center for Information Science and Technology (BNRist); School of Software, Tsinghua University, China ZIJIA LIN, Microsoft Research, China SICHENG ZHAO, University of California, Berkeley, USA XIAOPENG GU, National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data(PSRPC); China Academy of Electronics and Information Technology, China WENYUAN XU, Ubiquitous System Security Lab (USSLab), Zhejiang University, China JUNGONG HAN, University of Warwick, UK

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Rodrigo C. Barros,et al.  Bidirectional Retrieval Made Simple , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[5]  Yan Huang,et al.  Learning Semantic Concepts and Order for Image and Sentence Matching , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[8]  Xuelong Li,et al.  Robust Visual Tracking Using Structurally Random Projection and Weighted Least Squares , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[9]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[10]  Qionghai Dai,et al.  DECODE: Deep Confidence Network for Robust Image Classification , 2019, IEEE Transactions on Image Processing.

[11]  Qiang Ni,et al.  Joint Image-Text Hashing for Fast Large-Scale Cross-Media Retrieval Using Self-Supervised Deep Learning , 2019, IEEE Transactions on Industrial Electronics.

[12]  Wenwu Zhu,et al.  Message From the Outgoing Editor-in-Chief , 2020, IEEE Trans. Multim..

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[15]  Xirong Li,et al.  Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Wei Wang,et al.  Instance-Aware Image and Sentence Matching with Selective Multimodal LSTM , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Yongdong Zhang,et al.  STAT: Spatial-Temporal Attention Mechanism for Video Captioning , 2020, IEEE Transactions on Multimedia.

[18]  Zhedong Zheng,et al.  Dual-path Convolutional Image-Text Embeddings with Instance Loss , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[19]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[20]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Rama Chellappa,et al.  Learning Common and Feature-Specific Patterns: A Novel Multiple-Sparse-Representation-Based Tracker , 2018, IEEE Transactions on Image Processing.

[22]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[23]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Qionghai Dai,et al.  Cross-Modality Bridging and Knowledge Transferring for Image Understanding , 2019, IEEE Transactions on Multimedia.

[25]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Qingming Huang,et al.  Hedging Deep Features for Visual Tracking , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[28]  Svetlana Lazebnik,et al.  Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Amit K. Roy-Chowdhury,et al.  Webly Supervised Joint Embedding for Cross-Modal Image-Text Retrieval , 2018, ACM Multimedia.

[32]  ZhaoDebin,et al.  Multi-layered gesture recognition with Kinect , 2015 .

[33]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[34]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[35]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Pong C. Yuen,et al.  Learning Modality-Consistency Feature Templates: A Robust RGB-Infrared Tracking System , 2019, IEEE Transactions on Industrial Electronics.

[39]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[40]  Hui Chen,et al.  Show, Observe and Tell: Attribute-driven Attention Model for Image Captioning , 2018, IJCAI.

[41]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[42]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Hui Chen,et al.  GRN: Gated Relation Network to Enhance Convolutional Neural Network for Named Entity Recognition , 2019, AAAI.

[44]  Pong C. Yuen,et al.  Robust Visual Tracking via Basis Matching , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[45]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[46]  Ling Shao,et al.  Unsupervised Deep Video Hashing via Balanced Code for Large-Scale Video Retrieval , 2019, IEEE Transactions on Image Processing.

[47]  Rama Chellappa,et al.  Joint Sparse Representation and Robust Feature-Level Fusion for Multi-Cue Visual Tracking , 2015, IEEE Transactions on Image Processing.

[48]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[49]  Xuelong Li,et al.  A Biologically Inspired Appearance Model for Robust Visual Tracking , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[50]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[51]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[52]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[53]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[54]  Yang Gao,et al.  Multi-layered gesture recognition with Kinect , 2015, J. Mach. Learn. Res..

[55]  Aviv Eisenschtat,et al.  Linking Image and Text with 2-Way Nets , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[57]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Yuxin Peng,et al.  CM-GANs , 2019, ACM Trans. Multim. Comput. Commun. Appl..

[59]  Sam Coope,et al.  Neural Named Entity Recognition Using a Self-Attention Mechanism , 2017, 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI).

[60]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[62]  Gang Hua,et al.  Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[63]  Jungong Han,et al.  Real-Time Scalable Visual Tracking via Quadrangle Kernelized Correlation Filters , 2018, IEEE Transactions on Intelligent Transportation Systems.

[64]  Meng Liu,et al.  Attentive Moment Retrieval in Videos , 2018, SIGIR.