Exploring Visual-Audio Composition Alignment Network for Quality Fashion Retrieval in Video

Fashion retrieval in video suffers from the issues of imperfect visual representation and low quality of search results under the E-commercial circumstance. Previous works generally focus on searching the identical images from visual perspective only, but lack of leveraging multi-modal information for high quality commodities. As a cross-domain problem, instructional or exhibiting audio reveals rich semantic information to facilite the video-to-shop task. In this paper, we present a novel Visual-Audio Composition Alignment Network (VACANet) to deal with quality fashion retrieval in video. Firstly, we introduce the visual-audio composition module in VACANet aiming to distinguish attentive and residual entities by learning semantic embedding from both visual and audio streams. Secondly, a quality alignment training scheme is then designed by quality-aware triplet mining and domain alignment constraint for video-to-image adaptation. Finally, extensive experiments conducted on challenging video datasets demonstrate the scalable effectiveness of our model in alleviating quality fashion retrieval.

[1]  Alberto L. Sangiovanni-Vincentelli,et al.  A Review of Single-Source Deep Unsupervised Visual Domain Adaptation , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[2]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3]  Kurt Keutzer,et al.  Multi-source Distilling Domain Adaptation , 2020, AAAI.

[4]  Rong Jin,et al.  Visual Search at Alibaba , 2018, KDD.

[5]  Robinson Piramuthu,et al.  Palette power: enabling visual search through colors , 2013, KDD.

[6]  Dongqing Zhang,et al.  Neural Aggregation Network for Video Face Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jesús Martínez del Rincón,et al.  Recurrent Convolutional Network for Video-Based Person Re-identification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Jeff Donahue,et al.  Visual Search at Pinterest , 2015, KDD.

[9]  Kurt Keutzer,et al.  CycleEmotionGAN: Emotional Semantic Consistency Preserved CycleGAN for Adapting Image Emotions , 2019, AAAI.

[10]  Donghui Wang,et al.  Dress like an Internet Celebrity: Fashion Retrieval in Videos , 2020, IJCAI.

[11]  Li Fei-Fei,et al.  Composing Text and Image for Image Retrieval - an Empirical Odyssey , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kurt Keutzer,et al.  EmotionGAN: Unsupervised Domain Adaptation for Learning Discrete Probability Distributions of Image Emotions , 2018, ACM Multimedia.

[13]  Yang Liu,et al.  Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Qiaosong Wang,et al.  Visual Search at eBay , 2017, KDD.

[15]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Tobias Senst,et al.  Extending IOU Based Multi-Object Tracking by Visual Information , 2018, 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS).

[18]  Yu Liu,et al.  Quality Aware Network for Set to Set Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yinghui Xu,et al.  Virtual ID Discovery from E-commerce Media at Alibaba: Exploiting Richness of User Click Behavior for Visual Search Relevance , 2019, CIKM.

[20]  Kate Saenko,et al.  Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.