Compositional Mixture Representations for Vision and Text