A survey on multimodal-guided visual content synthesis