Multilabel Image Classification via Feature/Label Co-Projection

This article presents a simple and intuitive solution for multilabel image classification, which achieves the competitive performance on the popular COCO and PASCAL VOC benchmarks. The main idea is to capture how humans perform this task: we recognize both labels (i.e., objects and attributes) and the correlation of labels at the same time. Here, label recognition is performed by a standard ConvNet pipeline, whereas label correlation modeling is done by projecting both labels and image features extracted by the ConvNet to a common latent vector space. Specifically, we carefully design the loss function to ensure that: 1) labels and features that co-appear frequently are close to each other in the latent space and 2) conversely, labels/features that do not appear together are far apart. This information is then combined with the original ConvNet outputs to form the final prediction. The whole model is trained end-to-end, with no additional supervised information other than the image-level supervised information. Experiments show that the proposed method consistently outperforms previous approaches on COCO and PASCAL VOC in terms of mAP, macro/micro precision, recall, and $F$ -measure. Further, our model is highly efficient at test time, with only a small number of additional weights compared to the base model for direct label recognition.