Watch fashion shows to tell clothing attributes

Abstract In this paper, we propose a novel semi-supervised method to predict clothing attributes with the assistance of unlabeled data like fashion shows. To this end, a two-stage framework is built, i.e., the unsupervised triplet network pre-training stage that ensures frames in the same video having coherent representations while frames from different videos having larger feature distances, and a supervised clothing attribute prediction stage to estimate the value of attributes. Specifically, we first detect the clothes of frames in the collected 18,737 female fashion shows and 21,224 male fashion shows which contain no extra labels. Then a triplet neural network is constructed via embedding the temporal appearance consistency between frames in the same video and the representation gap in different videos. Finally, we transfer the triplet model parameters to multi-task clothing attribute prediction model, and fine-tune it with clothing images holding attribute labels. Extensive experiments demonstrate the advantages of the proposed method on two clothing datasets.

[1]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Alexei A. Efros,et al.  Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Huizhong Chen,et al.  Describing Clothing by Semantic Attributes , 2012, ECCV.

[5]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[6]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[7]  Francesc Moreno-Noguer,et al.  A High Performance CRF Model for Clothes Parsing , 2014, ACCV.

[8]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[9]  Francesc Moreno-Noguer,et al.  Neuroaesthetics in fashion: Modeling the perception of fashionability , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Robinson Piramuthu,et al.  Large scale visual recommendations from street fashion images , 2014, KDD.

[11]  Alexander C. Berg,et al.  Runway to Realway: Visual Analysis of Fashion , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[12]  Jia Chen,et al.  Unified Structured Learning for Simultaneous Human Pose Estimation and Garment Attribute Classification , 2014, IEEE Transactions on Image Processing.

[13]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[14]  Hong Yan,et al.  Clothing segmentation using foreground and background estimation based on the constrained Delaunay triangulation , 2008, Pattern Recognit..

[15]  Shuicheng Yan,et al.  Fashion Parsing With Weak Color-Category Labels , 2014, IEEE Transactions on Multimedia.

[16]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[17]  Xiaogang Wang,et al.  Learning from massive noisy labeled data for image classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Shuicheng Yan,et al.  Clothing Attributes Assisted Person Reidentification , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[19]  Subhransu Maji,et al.  Describing people: A poselet-based approach to attribute classification , 2011, 2011 International Conference on Computer Vision.

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Abhinav Gupta,et al.  Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Qiang Chen,et al.  Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Jian Dong,et al.  Deep Human Parsing with Active Template Regression , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Hanjiang Lai,et al.  Simultaneous feature learning and hash coding with deep neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2015, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Jian Dong,et al.  Deep domain adaptation for describing people based on fine-grained clothing attributes , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Changsheng Xu,et al.  Matching-CNN meets KNN: Quasi-parametric human parsing , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Yi Li,et al.  Robot Learning Manipulation Action Plans by "Watching" Unconstrained Videos from the World Wide Web , 2015, AAAI.

[32]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Martial Hebert,et al.  Watch and learn: Semi-supervised learning of object detectors from videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Changsheng Xu,et al.  Hi, magic closet, tell me what to wear! , 2012, ACM Multimedia.

[35]  Xiaochun Cao,et al.  Fashion Parsing With Video Context , 2015, IEEE Trans. Multim..

[36]  Tamara L. Berg,et al.  Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items , 2013, 2013 IEEE International Conference on Computer Vision.

[37]  Paolo Favaro,et al.  Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles , 2016, ECCV.

[38]  Hanqing Lu,et al.  Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Yunchao Wei,et al.  Towards Computational Baby Learning: A Weakly-Supervised Approach for Object Detection , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[43]  Meng Wang,et al.  Predicting occupation via human clothing and contexts , 2011, 2011 International Conference on Computer Vision.

[44]  Xiaogang Wang,et al.  Fashion Landmark Detection in the Wild , 2016, ECCV.

[45]  Jian-Huang Lai,et al.  Deep Ranking for Person Re-Identification via Joint Representation Learning , 2015, IEEE Transactions on Image Processing.