Learning Fashion Compatibility with Bidirectional LSTMs

The ubiquity of online fashion shopping demands effective recommendation services for customers. In this paper, we study two types of fashion recommendation: (i) suggesting an item that matches existing components in a set to form a stylish outfit (a collection of fashion items), and (ii) generating an outfit with multimodal (images/text) specifications from a user. To this end, we propose to jointly learn a visual-semantic embedding and the compatibility relationships among fashion items in an end-to-end fashion. More specifically, we consider a fashion outfit to be a sequence (usually from top to bottom and then accessories) and each item in the outfit as a time step. Given the fashion items in an outfit, we train a bidirectional LSTM (Bi-LSTM) model to sequentially predict the next item conditioned on previous ones to learn their compatibility relationships. Further, we learn a visual-semantic space by regressing image features to their semantic representations aiming to inject attribute and category information as a regularization for training the LSTM. The trained network can not only perform the aforementioned recommendations effectively but also predict the compatibility of a given outfit. We conduct extensive experiments on our newly collected Polyvore dataset, and the results provide strong qualitative and quantitative evidence that our framework outperforms alternative methods.

[1]  Larry S. Davis,et al.  Automatic Spatially-Aware Fashion Concept Discovery , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Jiebo Luo,et al.  Mining Fashion Outfit Composition Using an End-to-End Deep Learning Approach on Set Data , 2016, IEEE Transactions on Multimedia.

[3]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Tinne Tuytelaars,et al.  Modeling Visual Compatibility through Hierarchical Mid-level Elements , 2016, ArXiv.

[5]  Yi Yang,et al.  Uncovering Temporal Context for Video Question and Answering , 2015, ArXiv.

[6]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Anton van den Hengel,et al.  Image-Based Recommendations on Styles and Substitutes , 2015, SIGIR.

[8]  Ranjitha Kumar,et al.  The Elements of Fashion Style , 2016, UIST.

[9]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[11]  Mubarak Shah,et al.  Video Fill in the Blank with Merging LSTMs , 2016, ArXiv.

[12]  Changsheng Xu,et al.  Hi, magic closet, tell me what to wear! , 2012, ACM Multimedia.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Luis E. Ortiz,et al.  Retrieving Similar Styles to Parse Clothing , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Chong-Wah Ngo,et al.  Learning Query and Image Similarities with Ranking Canonical Correlation Analysis , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[17]  Shuicheng Yan,et al.  Clothes Co-Parsing Via Joint Image Segmentation and Labeling With Application to Clothing Retrieval , 2016, IEEE Transactions on Multimedia.

[18]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[19]  Licheng Yu,et al.  Visual Madlibs: Fill in the Blank Description Generation and Question Answering , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Tegan Maharaj,et al.  A Dataset and Exploration of Models for Understanding Video Data through Fill-in-the-Blank Question-Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[22]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Tamara L. Berg,et al.  Paper Doll Parsing: Retrieving Similar Styles to Parse Clothing Items , 2013, 2013 IEEE International Conference on Computer Vision.

[26]  Liang Lin,et al.  Clothing Co-parsing by Joint Image Segmentation and Labeling , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Francesc Moreno-Noguer,et al.  Neuroaesthetics in fashion: Modeling the perception of fashionability , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Svetlana Lazebnik,et al.  Where to Buy It: Matching Street Clothing Photos in Online Shops , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[29]  Matthew J. Hausknecht,et al.  Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yu Zhou,et al.  Matching User Photos to Online Products with Robust Deep Features , 2016, ICMR.

[31]  Larry S. Davis,et al.  Collaborative Fashion Recommendation: A Functional Tensor Factorization Approach , 2015, ACM Multimedia.

[32]  Serge J. Belongie,et al.  Learning Visual Clothing Style with Heterogeneous Dyadic Co-Occurrences , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Changsheng Xu,et al.  Street-to-shop: Cross-scenario clothing retrieval via parts alignment and auxiliary set , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.