Multiple-Clothing Detection and Fashion Landmark Estimation Using a Single-Stage Detector

Fashion image analysis has attracted significant research attention owing to the availability of large-scale fashion datasets with rich annotations. However, existing deep learning models for fashion datasets often have high computational requirements. In this study, we propose a new model suitable for low-power devices. The proposed network is a one-stage detector that rapidly detects multiple cloths and landmarks in fashion images. The network is designed as a modification of the EfficientDet originally proposed by Google Brain. The proposed network simultaneously trains the core input features with different resolutions and applies compound scaling to the backbone feature network. The bounding box/class/landmark prediction networks maintain the balance between the speed and accuracy. Moreover, a low number of parameters and low computational cost make it efficient. Without image preprocessing, we achieved 0.686 mean average precision (mAP) in the bounding box detection and 0.450 mAP in the landmark estimation on the DeepFashion2 validation dataset with an inference time of 42 ms. We obtained optimal results in extensive experiments with loss functions and optimizers. Furthermore, the proposed method has the advantage of operating in low-power devices.

[1]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[2]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Svetlana Lazebnik,et al.  Where to Buy It: Matching Street Clothing Photos in Online Shops , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[4]  Christian Szegedy,et al.  DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Luc Van Gool,et al.  Apparel Classification with Style , 2012, ACCV.

[6]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[7]  Maxim Kazakov,et al.  DeepMark: One-Shot Clothing Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[8]  Xiaogang Wang,et al.  Fashion Landmark Detection in the Wild , 2016, ECCV.

[9]  Emre Akbas,et al.  MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network , 2018, ECCV.

[10]  Chanho Jung,et al.  A Global-Local Embedding Module for Fashion Landmark Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[11]  Zhaohui Zheng,et al.  Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation , 2020, ArXiv.

[12]  Xiaogang Wang,et al.  Unconstrained Fashion Landmark Detection via Hierarchical Recurrent Transformer Networks , 2017, ACM Multimedia.

[13]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[14]  Xiantong Zhen,et al.  Two-Stream Multi-Task Network for Fashion Recognition , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[15]  Qiang Chen,et al.  Cross-Domain Image Retrieval with a Dual Attribute-Aware Ranking Network , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Jason Yosinski,et al.  An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution , 2018, NeurIPS.

[17]  Ruimao Zhang,et al.  DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Quoc V. Le,et al.  EfficientDet: Scalable and Efficient Object Detection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[21]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Shuyuan Yang,et al.  A Survey of Deep Learning-Based Object Detection , 2019, IEEE Access.

[25]  Xingyi Zhou,et al.  Objects as Points , 2019, ArXiv.

[26]  Trevor Darrell,et al.  Deep Layer Aggregation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Seong Joon Oh,et al.  AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights , 2020, ICLR.

[28]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yang Zhao,et al.  Deep High-Resolution Representation Learning for Visual Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[31]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[32]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Huizhong Chen,et al.  Describing Clothing by Semantic Attributes , 2012, ECCV.

[34]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Zhaohui Zheng,et al.  Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression , 2019, AAAI.

[38]  Xiaogang Wang,et al.  DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.