Learning Accent Representation with Multi-Level VAE Towards Controllable Speech Synthesis