Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning