Multi-scale dual-modal generative adversarial networks for text-to-image synthesis