论文信息 - Exploring Visual-Audio Composition Alignment Network for Quality Fashion Retrieval in Video

Exploring Visual-Audio Composition Alignment Network for Quality Fashion Retrieval in Video

Fashion retrieval in video suffers from the issues of imperfect visual representation and low quality of search results under the E-commercial circumstance. Previous works generally focus on searching the identical images from visual perspective only, but lack of leveraging multi-modal information for high quality commodities. As a cross-domain problem, instructional or exhibiting audio reveals rich semantic information to facilite the video-to-shop task. In this paper, we present a novel Visual-Audio Composition Alignment Network (VACANet) to deal with quality fashion retrieval in video. Firstly, we introduce the visual-audio composition module in VACANet aiming to distinguish attentive and residual entities by learning semantic embedding from both visual and audio streams. Secondly, a quality alignment training scheme is then designed by quality-aware triplet mining and domain alignment constraint for video-to-image adaptation. Finally, extensive experiments conducted on challenging video datasets demonstrate the scalable effectiveness of our model in alleviating quality fashion retrieval.

[1] Alberto L. Sangiovanni-Vincentelli,et al. A Review of Single-Source Deep Unsupervised Visual Domain Adaptation , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[2] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Kurt Keutzer,et al. Multi-source Distilling Domain Adaptation , 2020, AAAI.

[4] Rong Jin,et al. Visual Search at Alibaba , 2018, KDD.

[5] Robinson Piramuthu,et al. Palette power: enabling visual search through colors , 2013, KDD.

[6] Dongqing Zhang,et al. Neural Aggregation Network for Video Face Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Jesús Martínez del Rincón,et al. Recurrent Convolutional Network for Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Jeff Donahue,et al. Visual Search at Pinterest , 2015, KDD.

[9] Kurt Keutzer,et al. CycleEmotionGAN: Emotional Semantic Consistency Preserved CycleGAN for Adapting Image Emotions , 2019, AAAI.

[10] Donghui Wang,et al. Dress like an Internet Celebrity: Fashion Retrieval in Videos , 2020, IJCAI.

[11] Li Fei-Fei,et al. Composing Text and Image for Image Retrieval - an Empirical Odyssey , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Kurt Keutzer,et al. EmotionGAN: Unsupervised Domain Adaptation for Learning Discrete Probability Distributions of Image Emotions , 2018, ACM Multimedia.

[13] Yang Liu,et al. Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Qiaosong Wang,et al. Visual Search at eBay , 2017, KDD.

[15] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17] Tobias Senst,et al. Extending IOU Based Multi-Object Tracking by Visual Information , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[18] Yu Liu,et al. Quality Aware Network for Set to Set Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Yinghui Xu,et al. Virtual ID Discovery from E-commerce Media at Alibaba: Exploiting Richness of User Click Behavior for Visual Search Relevance , 2019, CIKM.

[20] Kate Saenko,et al. Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.