Cross-modality co-attention networks for visual question answering