Video2Shop: Exact Matching Clothes in Videos to Online Shopping Images

In recent years, both online retail and video hosting service have been exponentially grown. In this paper, a novel deep neural network, called AsymNet, is proposed to explore a new cross-domain task, Video2Shop, targeting for matching clothes appeared in videos to the exactly same items in online shops. For the image side, well-established methods are used to detect and extract features for clothing patches with arbitrary sizes. For the video side, deep visual features are extracted from detected object regions in each frame, and further fed into a Long Short-Term Memory (LSTM) framework for sequence modeling, which captures the temporal dynamics in videos. To conduct exact matching between videos and online shopping images, LSTM hidden states for videos and image features extracted from static images are jointly modeled, under the similarity network with reconfigurable deep tree structure. Moreover, an approximate training method is proposed to achieve the efficiency when training. Extensive experiments conducted on a large cross-domain dataset have demonstrated the effectiveness and efficiency of the proposed AsymNet, which outperforms the state-of-the-art methods.

[1]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[2]  Shuicheng Yan,et al.  Clothes Co-Parsing Via Joint Image Segmentation and Labeling With Application to Clothing Retrieval , 2016, IEEE Transactions on Multimedia.

[3]  Hanqing Lu,et al.  Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Bo Zhao,et al.  Clothing Cosegmentation for Shopping Images With Cluttered Background , 2016, IEEE Transactions on Multimedia.

[5]  Svetlana Lazebnik,et al.  Where to Buy It: Matching Street Clothing Photos in Online Shops , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Florent Perronnin,et al.  Fisher Kernels on Visual Vocabularies for Image Categorization , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Jianmin Wang,et al.  Deep Hashing Network for Efficient Similarity Retrieval , 2016, AAAI.

[8]  Yang Liu,et al.  Video eCommerce++: Toward Large Scale Online Video Advertising , 2017, IEEE Transactions on Multimedia.

[9]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Thomas Mensink,et al.  Image Classification with the Fisher Vector: Theory and Practice , 2013, International Journal of Computer Vision.

[11]  Yu Zhou,et al.  Matching User Photos to Online Products with Robust Deep Features , 2016, ICMR.

[12]  Yang Liu,et al.  Video eCommerce: Towards Online Video Advertising , 2016, ACM Multimedia.

[13]  Wojciech Zaremba,et al.  Learning to Execute , 2014, ArXiv.

[14]  Nikos Komodakis,et al.  Learning to compare image patches via convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Shuicheng Yan,et al.  Deep Search with Attribute-aware Deep Network , 2014, ACM Multimedia.

[16]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Changsheng Xu,et al.  Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Rui Caseiro,et al.  High-Speed Tracking with Kernelized Correlation Filters , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Yannis Kalantidis,et al.  Getting the look: clothing recognition and segmentation for automatic product suggestions in everyday photos , 2013, ICMR.

[22]  Rahul Sukthankar,et al.  MatchNet: Unifying feature and metric learning for patch-based matching , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Gang Sun,et al.  A Key Volume Mining Deep Framework for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  G. Enderlein,et al.  McCullagh, P., J. A. Nelder: Generalized linear models. Chapman and Hall London – New York 1983, 261 S., £ 16,– , 1987 .

[25]  Kavita Bala,et al.  Learning visual similarity for product design with convolutional neural networks , 2015, ACM Trans. Graph..

[26]  Yann LeCun,et al.  Computing the stereo matching cost with a convolutional neural network , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[28]  Qiang Chen,et al.  Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[31]  Cordelia Schmid,et al.  Aggregating local descriptors into a compact image representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.