Fashion Image Retrieval with Capsule Networks

In this study, we investigate in-shop clothing retrieval performance of densely-connected Capsule Networks with dynamic routing. To achieve this, we propose Triplet-based design of Capsule Network architecture with two different feature extraction methods. In our design, Stacked-convolutional (SC) and Residual-connected (RC) blocks are used to form the input of capsule layers. Experimental results show that both of our designs outperform all variants of the baseline study, namely FashionNet, without relying on the landmark information. Moreover, when compared to the SOTA architectures on clothing retrieval, our proposed Triplet Capsule Networks achieve comparable recall rates only with half of parameters used in the SOTA architectures.

[1]  Weilin Huang,et al.  Deep Metric Learning with Hierarchical Triplet Loss , 2018, ECCV.

[2]  Hedi Ben-younes,et al.  Leveraging Weakly Annotated Data for Fashion Image Retrieval and Label Prediction , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[3]  Qiang Chen,et al.  Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jun Zhou,et al.  Clothing retrieval with visual attention model , 2017, 2017 IEEE Visual Communications and Image Processing (VCIP).

[6]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[7]  Svetlana Lazebnik,et al.  Where to Buy It: Matching Street Clothing Photos in Online Shops , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8]  Jungmin Lee,et al.  Attention-based Ensemble for Deep Metric Learning , 2018, ECCV.

[9]  Horst Possegger,et al.  BIER — Boosting Independent Embeddings Robustly , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[11]  Chao Zhang,et al.  Hard-Aware Deeply Cascaded Embedding , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.