Cross-modal fusion for multi-label image classification with attention mechanism