On Granularity of Prosodic Representations in Expressive Text-to-Speech