Combining text and vision in compound semantics: Towards a cognitively plausible multimodal model