Weakly-supervised image captioning based on rich contextual information